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Preface 


Probability  and  statistics  are  fascinating  subjects  on  the  interface  between 
mathematics  and  applied  sciences  that  help  us  understand  and  solve  practical 
problems.  We  believe  that  you,  by  learning  how  stochastic  methods  come 
about  and  why  they  work,  will  be  able  to  understand  the  meaning  of  statistical 
statements  as  well  as  judge  the  quality  of  their  content,  when  facing  such 
problems  on  your  own.  Our  philosophy  is  one  of  how  and  why:  instead  of  just 
presenting  stochastic  methods  as  cookbook  recipes,  we  prefer  to  explain  the 
principles  behind  them. 

In  this  book  you  will  find  the  basics  of  probability  theory  and  statistics.  In 
addition,  there  are  several  topics  that  go  somewhat  beyond  the  basics  but 
that  ought  to  be  present  in  an  introductory  course:  simulation,  the  Poisson 
process,  the  law  of  large  numbers,  and  the  central  limit  theorem.  Computers 
have  brought  many  changes  in  statistics.  In  particular,  the  bootstrap  has 
earned  its  place.  It  provides  the  possibility  to  derive  confidence  intervals  and 
perform  tests  of  hypotheses  where  traditional  (normal  approximation  or  large 
sample)  methods  are  inappropriate.  It  is  a  modern  useful  tool  one  should  learn 
about,  we  believe. 

Examples  and  datasets  in  this  book  are  mostly  from  real-life  situations,  at 
least  that  is  what  we  looked  for  in  illustrations  of  the  material.  Anybody  who 
has  inspected  datasets  with  the  purpose  of  using  them  as  elementary  examples 
knows  that  this  is  hard:  on  the  one  hand,  you  do  not  want  to  boldly  state 
assumptions  that  are  clearly  not  satisfied;  on  the  other  hand,  long  explanations 
concerning  side  issues  distract  from  the  main  points.  We  hope  that  we  found 
a  good  middle  way. 

A  first  course  in  calculus  is  needed  as  a  prerequisite  for  this  book.  In  addition 
to  high-school  algebra,  some  infinite  series  are  used  (exponential,  geometric). 
Integration  and  differentiation  are  the  most  important  skills,  mainly  concern¬ 
ing  one  variable  (the  exceptions,  two  dimensional  integrals,  are  encountered  in 
Chapters  9-11).  Although  the  mathematics  is  kept  to  a  minimum,  we  strived 
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to  be  mathematically  correct  throughout  the  book.  With  respect  to  probabil¬ 
ity  and  statistics  the  book  is  self-contained. 

The  book  is  aimed  at  undergraduate  engineering  students,  and  students  from 
more  business-oriented  studies  (who  may  gloss  over  some  of  the  more  mathe¬ 
matically  oriented  parts) .  At  our  own  university  we  also  use  it  for  students  in 
applied  mathematics  (where  we  put  a  little  more  emphasis  on  the  math  and 
add  topics  like  combinatorics,  conditional  expectations,  and  generating  func¬ 
tions).  It  is  designed  for  a  one-semester  course:  on  average  two  hours  in  class 
per  chapter,  the  first  for  a  lecture,  the  second  doing  exercises.  The  material 
is  also  well-suited  for  self-study,  as  we  know  from  experience. 

We  have  divided  attention  about  evenly  between  probability  and  statistics. 
The  very  first  chapter  is  a  sampler  with  differently  flavored  introductory  ex¬ 
amples,  ranging  from  scientific  success  stories  to  a  controversial  puzzle.  Topics 
that  follow  are  elementary  probability  theory,  simulation,  joint  distributions, 
the  law  of  large  numbers,  the  central  limit  theorem,  statistical  modeling  (in¬ 
formal:  why  and  how  we  can  draw  inference  from  data),  data  analysis,  the 
bootstrap,  estimation,  simple  linear  regression,  confidence  intervals,  and  hy¬ 
pothesis  testing.  Instead  of  a  few  chapters  with  a  long  list  of  discrete  and 
continuous  distributions,  with  an  enumeration  of  the  important  attributes  of 
each,  we  introduce  a  few  distributions  when  presenting  the  concepts  and  the 
others  where  they  arise  (more)  naturally.  A  list  of  distributions  and  their 
characteristics  is  found  in  Appendix  A. 

With  the  exception  of  the  first  one,  chapters  in  this  book  consist  of  three  main 
parts.  First,  about  four  sections  discussing  new  material,  interspersed  with  a 
handful  of  so-called  Quick  exercises.  Working  these — two-or-three-minute — 
exercises  should  help  to  master  the  material  and  provide  a  break  from  reading 
to  do  something  more  active.  On  about  two  dozen  occasions  you  will  find 
indented  paragraphs  labeled  Remark,  where  we  felt  the  need  to  discuss  more 
mathematical  details  or  background  material.  These  remarks  can  be  skipped 
without  loss  of  continuity;  in  most  cases  they  require  a  bit  more  mathematical 
maturity.  Whenever  persons  are  introduced  in  examples  we  have  determined 
their  sex  by  looking  at  the  chapter  number  and  applying  the  rule  “He  is  odd, 
she  is  even.”  Solutions  to  the  quick  exercises  are  found  in  the  second  to  last 
section  of  each  chapter. 

The  last  section  of  each  chapter  is  devoted  to  exercises,  on  average  thirteen 
per  chapter.  For  about  half  of  the  exercises,  answers  are  given  in  Appendix  C, 
and  for  half  of  these,  full  solutions  in  Appendix  D.  Exercises  with  both  a 
short  answer  and  a  full  solution  are  marked  with  ffl  and  those  with  only  a 
short  answer  are  marked  with  □  (when  more  appropriate,  for  example,  in 
“Show  that  ...”  exercises,  the  short  answer  provides  a  hint  to  the  key  step) . 
Typically,  the  section  starts  with  some  easy  exercises  and  the  order  of  the 
material  in  the  chapter  is  more  or  less  respected.  More  challenging  exercises 
are  found  at  the  end. 


Preface  VII 


Much  of  the  material  in  this  book  would  benefit  from  illustration  with  a 
computer  using  statistical  software.  A  complete  course  should  also  involve 
computer  exercises.  Topics  like  simulation,  the  law  of  large  numbers,  the 
central  limit  theorem,  and  the  bootstrap  loudly  call  for  this  kind  of  experi¬ 
ence.  For  this  purpose,  all  the  datasets  discussed  in  the  book  are  available  at 
http://www.springeronline.com/l-85233-896-2.  The  same  Web  site  also  pro¬ 
vides  access,  for  instructors,  to  a  complete  set  of  solutions  to  the  exercises; 
go  to  the  Springer  online  catalog  or  contact  textbooks@springer-sbm.com  to 
apply  for  your  password. 

Delft,  The  Netherlands  F.  M.  Dekking 

January  2005  C.  Kraaikamp 

H.  P.  Lopuhaa 
L.  E.  Meester 
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Why  probability  and  statistics? 


Is  everything  on  this  planet  determined  by  randomness?  This  question  is  open 
to  philosophical  debate.  What  is  certain  is  that  every  day  thousands  and 
thousands  of  engineers,  scientists,  business  persons,  manufacturers,  and  others 
are  using  tools  from  probability  and  statistics. 

The  theory  and  practice  of  probability  and  statistics  were  developed  during 
the  last  century  and  are  still  actively  being  refined  and  extended.  In  this  book 
we  will  introduce  the  basic  notions  and  ideas,  and  in  this  first  chapter  we 
present  a  diverse  collection  of  examples  where  randomness  plays  a  role. 


1.1  Biometry:  iris  recognition 

Biometry  is  the  art  of  identifying  a  person  on  the  basis  of  his  or  her  personal 
biological  characteristics,  such  as  fingerprints  or  voice.  From  recent  research 
it  appears  that  with  the  human  iris  one  can  beat  all  existing  automatic  hu¬ 
man  identification  systems.  Iris  recognition  technology  is  based  on  the  visible 
qualities  of  the  iris.  It  converts  these — via  a  video  camera — into  an  “iris  code” 
consisting  of  just  2048  bits.  This  is  done  in  such  a  way  that  the  code  is  hardly 
sensitive  to  the  size  of  the  iris  or  the  size  of  the  pupil.  However,  at  different 
times  and  different  places  the  iris  code  of  the  same  person  will  not  be  exactly 
the  same.  Thus  one  has  to  allow  for  a  certain  percentage  of  mismatching  bits 
when  identifying  a  person.  In  fact,  the  system  allows  about  34%  mismatches! 
How  can  this  lead  to  a  reliable  identification  system?  The  miracle  is  that  dif¬ 
ferent  persons  have  very  different  irides.  In  particular,  over  a  large  collection 
of  different  irides  the  code  bits  take  the  values  0  and  1  about  half  of  the  time. 
But  that  is  certainly  not  sufficient:  if  one  bit  would  determine  the  other  2047, 
then  we  could  only  distinguish  two  persons.  In  other  words,  single  bits  may 
be  random,  but  the  correlation  between  bits  is  also  crucial  (we  will  discuss 
correlation  at  length  in  Chapter  10).  John  Daugman  who  has  developed  the 
iris  recognition  technology  made  comparisons  between  222  743  pairs  of  iris 


2 


1  Why  probability  and  statistics? 


codes  and  concluded  that  of  the  2048  bits  266  may  be  considered  as  uncor¬ 
related  ([6]).  He  then  argues  that  we  may  consider  an  iris  code  as  the  result 
of  266  coin  tosses  with  a  fair  coin.  This  implies  that  if  we  compare  two  such 
codes  from  different  persons,  then  there  is  an  astronomically  small  probability 
that  these  two  differ  in  less  than  34%  of  the  bits — almost  all  pairs  will  differ 
in  about  50%  of  the  bits.  This  is  illustrated  in  Figure  1.1,  which  originates 
from  [6],  and  was  kindly  provided  by  John  Daugman.  The  iris  code  data  con¬ 
sist  of  numbers  between  0  and  1,  each  a  Hamming  distance  (the  fraction  of 
mismatches)  between  two  iris  codes.  The  data  have  been  summarized  in  two 
histograms,  that  is,  two  graphs  that  show  the  number  of  counts  of  Hamming 
distances  falling  in  a  certain  interval.  We  will  encounter  histograms  and  other 
summaries  of  data  in  Chapter  15.  One  sees  from  the  figure  that  for  codes  from 
the  same  iris  (left  side)  the  mismatch  fraction  is  only  about  0.09,  while  for 
different  hides  (right  side)  it  is  about  0.46. 


DECISION  ENVIRONMENT 
FOR  IRIS  RECOGNITION 


222,743  comparisons  of  different  iris  pairs 
546  comparisons  of  same  iris  pairs 

mean  =  0.456 
stnd  dev  =  0.018 

d’  =  11.36 

Theoretical  curves:  binomiai  famiiy 
Theoretical  cross-over  point:  HD  =  0.342 
Theoretical  cross-over  rate:  1  in  1.2  million 
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Fig.  1.1.  Comparison  of  same  and  different  iris  pairs. 

Source:  J. Daugman.  Second  IMA  Conference  on  Image  Processing:  Mathe¬ 
matical  Methods,  Algorithms  and  Applications,  2000.  (§)  Ellis  Horwood  Pub¬ 
lishing  Limited. 


You  may  still  wonder  how  it  is  possible  that  irides  distinguish  people  so  well. 
What  about  twins,  for  instance?  The  surprising  thing  is  that  although  the 
color  of  eyes  is  hereditary,  many  features  of  iris  patterns  seem  to  be  pro¬ 
duced  by  so-called  epigenetic  events.  This  means  that  during  embryo  develop¬ 
ment  the  iris  structure  develops  randomly.  In  particular,  the  iris  patterns  of 
(monozygotic)  twins  are  as  discrepant  as  those  of  two  arbitrary  individuals. 
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For  this  reason,  as  early  as  in  the  1930s,  eye  specialists  proposed  that  iris 
patterns  might  be  used  for  identification  purposes. 


1.2  Killer  football 

A  couple  of  years  ago  the  prestigious  British  Medical  Journal  published  a 
paper  with  the  title  “Cardiovascular  mortality  in  Dutch  men  during  1996 
European  football  championship:  longitudinal  population  study”  ([41]).  The 
authors  claim  to  have  shown  that  the  effect  of  a  single  football  match  is 
detectable  in  national  mortality  data.  They  consider  the  mortality  from  in¬ 
farctions  (heart  attacks)  and  strokes,  and  the  “explanation”  of  the  increase  is 
a  combination  of  heavy  alcohol  consumption  and  stress  caused  by  watching 
the  football  match  on  June  22  between  the  Netherlands  and  France  (lost  by 
the  Dutch  team!).  The  authors  mainly  support  their  claim  with  a  figure  like 
Figure  1.2,  which  shows  the  number  of  deaths  from  the  causes  mentioned  (for 
men  over  45),  during  the  period  June  17  to  June  27,  1996.  The  middle  horizon¬ 
tal  line  marks  the  average  number  of  deaths  on  these  days,  and  the  upper  and 
lower  horizontal  lines  mark  what  the  authors  call  the  95%  confidence  inter¬ 
val.  The  construction  of  such  an  interval  is  usually  performed  with  standard 
statistical  techniques,  which  you  will  learn  in  Chapter  23.  The  interpretation 
of  such  an  interval  is  rather  tricky.  That  the  bar  on  June  22  sticks  out  off  the 
confidence  interval  should  support  the  “killer  claim.” 


June  18  June  22  June  26 

Fig.  1.2.  Number  of  deaths  from  infarction  or  stroke  in  (part  of)  June  1996. 


It  is  rather  surprising  that  such  a  conclusion  is  based  on  a  single  football 
match,  and  one  could  wonder  why  no  probability  model  is  proposed  in  the 
paper.  In  fact,  as  we  shall  see  in  Chapter  12,  it  would  not  be  a  bad  idea  to 
model  the  time  points  at  which  deaths  occur  as  a  so-called  Poisson  process. 
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Once  we  have  done  this,  we  can  compute  how  often  a  pattern  like  the  one  in  the 
figure  might  occur — without  paying  attention  to  football  matches  and  other 
high-risk  national  events.  To  do  this  we  need  the  mean  number  of  deaths  per 
day.  This  number  can  be  obtained  from  the  data  by  an  estimation  procedure 
(the  subject  of  Chapters  19  to  23).  We  use  the  sample  mean,  which  is  equal  to 
(10  •  27.2  +  41)/11  =  313/11  =  28.45.  (Here  we  have  to  make  a  computation 
like  this  because  we  only  use  the  data  in  the  paper:  27.2  is  the  average  over 
the  5  days  preceding  and  following  the  match,  and  41  is  the  number  of  deaths 
on  the  day  of  the  match.)  Now  let  Phigh  be  the  probability  that  there  are 
41  or  more  deaths  on  a  day,  and  let  Pusuai  be  the  probability  that  there  are 
between  21  and  34  deaths  on  a  day — here  21  and  34  are  the  lowest  and  the 
highest  number  that  fall  in  the  interval  in  Figure  1.2.  From  the  formula  of  the 
Poisson  distribution  given  in  Chapter  12  one  can  compute  that  Phigh  =  0.008 
and  Pusuai  =  0.820.  Since  events  on  different  days  are  independent  according 
to  the  Poisson  process  model,  the  probability  p  of  a  pattern  as  in  the  figure  is 

P  Pusuai  ’  Phigh  ’  Pusuai  0.0011. 

From  this  it  can  be  shown  by  (a  generalization  of)  the  law  of  large  numbers 
(which  we  will  study  in  Chapter  13)  that  such  a  pattern  would  appear  about 
once  every  1/0.0011  =  899  days.  So  it  is  not  overwhelmingly  exceptional  to 
find  such  a  pattern,  and  the  fact  that  there  was  an  important  football  match 
on  the  day  in  the  middle  of  the  pattern  might  just  have  been  a  coincidence. 


1.3  Cars  and  goats:  the  Monty  Hall  dilemma 

On  Sunday  September  9,  1990,  the  following  question  appeared  in  the  “Ask 
Marilyn”  column  in  Parade,  a  Sunday  supplement  to  many  newspapers  across 
the  United  States: 

Suppose  you’re  on  a  game  show,  and  you’re  given  the  choice  of  three 
doors;  behind  one  door  is  a  car;  behind  the  others,  goats.  You  pick  a 
door,  say  No.  1,  and  the  host,  who  knows  what’s  behind  the  doors, 
opens  another  door,  say  No.  3,  which  has  a  goat.  He  then  says  to  you, 

“Do  you  want  to  pick  door  No.  2?”  Is  it  to  your  advantage  to  switch 
your  choice? — Craig  F.  Whitaker,  Columbia,  Md. 

Marilyn’s  answer — one  should  switch — caused  an  avalanche  of  reactions,  in  to¬ 
tal  an  estimated  10  000.  Some  of  these  reactions  were  not  so  flattering  (“You 
are  the  goat”),  quite  a  lot  were  by  professional  mathematicians  (“You  blew 
it,  and  blew  it  big,”  “You  are  utterly  incorrect  ....  How  many  irate  mathe¬ 
maticians  are  needed  to  change  your  mind?” ) .  Perhaps  some  of  the  reactions 
were  so  strong,  because  Marilyn  vos  Savant,  the  author  of  the  column,  is  in 
the  Guinness  Book  of  Records  for  having  one  of  the  highest  IQs  in  the  world. 
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The  switching  question  was  inspired  by  Monty  Hall’s  “Let’s  Make  a  Deal” 
game  show,  which  ran  with  small  interruptions  for  23  years  on  various  U.S. 
television  networks. 

Although  it  is  not  explicitly  stated  in  the  question,  the  game  show  host  will 
always  open  a  door  with  a  goat  after  you  make  your  initial  choice.  Many 
people  would  argue  that  in  this  situation  it  does  not  matter  whether  one 
would  change  or  not:  one  door  has  a  car  behind  it,  the  other  a  goat,  so  the 
odds  to  get  the  car  are  fifty-fifty.  To  see  why  they  are  wrong,  consider  the 
following  argument.  In  the  original  situation  two  of  the  three  doors  have  a 
goat  behind  them,  so  with  probability  2/3  your  initial  choice  was  wrong,  and 
with  probability  1/3  it  was  right.  Now  the  host  opens  a  door  with  a  goat  (note 
that  he  can  always  do  this).  In  case  your  initial  choice  was  wrong  the  host  has 
only  one  option  to  show  a  door  with  a  goat,  and  switching  leads  you  to  the 
door  with  the  car.  In  case  your  initial  choice  was  right  the  host  has  two  goats 
to  choose  from,  so  switching  will  lead  you  to  a  goat.  We  see  that  switching 
is  the  best  strategy,  doubling  our  chances  to  win.  To  stress  this  argument, 
consider  the  following  generalization  of  the  problem:  suppose  there  are  10  000 
doors,  behind  one  is  a  car  and  behind  the  rest,  goats.  After  you  make  your 
choice,  the  host  will  open  9998  doors  with  goats,  and  offers  you  the  option  to 
switch.  To  change  or  not  to  change,  that’s  the  question!  Still  not  convinced? 
Use  your  Internet  browser  to  find  one  of  the  zillion  sites  where  one  can  run  a 
simulation  of  the  Monty  Hall  problem  (more  about  simulation  in  Chapter  6) . 

In  fact,  there  are  quite  a  lot  of  variations  on  the  problem.  For  example,  the 
situation  that  there  are  four  doors:  you  select  a  door,  the  host  always  opens  a 
door  with  a  goat,  and  offers  you  to  select  another  door.  After  you  have  made 
up  your  mind  he  opens  a  door  with  a  goat,  and  again  offers  you  to  switch. 
After  you  have  decided,  he  opens  the  door  you  selected.  What  is  now  the  best 
strategy?  In  this  situation  switching  only  at  the  last  possible  moment  yields 
a  probability  of  3/4  to  bring  the  car  home.  Using  the  law  of  total  probability 
from  Section  3.3  you  will  find  that  this  is  indeed  the  best  possible  strategy. 


1.4  The  space  shuttle  Challenger 

On  January  28,  1986,  the  space  shuttle  Challenger  exploded  about  one  minute 
after  it  had  taken  off  from  the  launch  pad  at  Kennedy  Space  Center  in  Florida. 
The  seven  astronauts  on  board  were  killed  and  the  spacecraft  was  destroyed. 
The  cause  of  the  disaster  was  explosion  of  the  main  fuel  tank,  caused  by  flames 
of  hot  gas  erupting  from  one  of  the  so-called  solid  rocket  boosters. 

These  solid  rocket  boosters  had  been  cause  for  concern  since  the  early  years 
of  the  shuttle.  They  are  manufactured  in  segments,  which  are  joined  at  a  later 
stage,  resulting  in  a  number  of  joints  that  are  sealed  to  protect  against  leakage. 
This  is  done  with  so-called  0-rings,  which  in  turn  are  protected  by  a  layer 
of  putty.  When  the  rocket  motor  ignites,  high  pressure  and  high  temperature 


6 


1  Why  probability  and  statistics? 


build  up  within.  In  time  these  may  burn  away  the  putty  and  subsequently 
erode  the  0-rings,  eventually  causing  hot  flames  to  erupt  on  the  outside.  In  a 
nutshell,  this  is  what  actually  happened  to  the  Challenger. 

After  the  explosion,  an  investigative  commission  determined  the  causes  of  the 
disaster,  and  a  report  was  issued  with  many  findings  and  recommendations 
([24]).  On  the  evening  of  January  27,  a  decision  to  launch  the  next  day  had 
been  made,  notwithstanding  the  fact  that  an  extremely  low  temperature  of 
31°F  had  been  predicted,  well  below  the  operating  limit  of  40°F  set  by  Morton 
Thiokol,  the  manufacturer  of  the  solid  rocket  boosters.  Apparently,  a  “man¬ 
agement  decision”  was  made  to  overrule  the  engineers’  recommendation  not 
to  launch.  The  inquiry  faulted  both  NASA  and  Morton  Thiokol  management 
for  giving  in  to  the  pressure  to  launch,  ignoring  warnings  about  problems  with 
the  seals. 

The  Challenger  launch  was  the  24th  of  the  space  shuttle  program,  and  we 
shall  look  at  the  data  on  the  number  of  failed  0-rings,  available  from  previous 
launches  (see  [5]  for  more  details).  Each  rocket  has  three  0-rings,  and  two 
rocket  boosters  are  used  per  launch,  so  in  total  six  0-rings  are  used  each 
time.  Because  low  temperatures  are  known  to  adversely  affect  the  0-rings, 
we  also  look  at  the  corresponding  launch  temperature.  In  Figure  1.3  the  dots 
show  the  number  of  failed  0-rings  per  mission  (there  are  23  dots — one  time  the 
boosters  could  not  be  recovered  from  the  ocean;  temperatures  are  rounded  to 
the  nearest  degree  Fahrenheit;  in  case  of  two  or  more  equal  data  points  these 
are  shifted  slightly.).  If  you  ignore  the  dots  representing  zero  failures,  which 
all  occurred  at  high  temperatures,  a  temperature  effect  is  not  apparent. 


Launch  temperature  in  °F 

Source:  based  on  data  from  Volume  VI  of  the  Report  of  the  Presidential 
Commission  on  the  space  shuttle  Challenger  accident,  Washington,  DC,  1986. 

Fig.  1.3.  Space  shuttle  failure  data  of  pre- Challenger  missions  and  fitted  model  of 
expected  number  of  failures  per  mission  function. 
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In  a  model  to  describe  these  data,  the  probability  p(t)  that  an  individual 
0-ring  fails  should  depend  on  the  launch  temperature  t.  Per  mission,  the 
number  of  failed  0-rings  follows  a  so-called  binomial  distribution:  six  0-rings, 
and  each  may  fail  with  probability  p(t) ;  more  about  this  distribution  and  the 
circumstances  under  which  it  arises  can  be  found  in  Chapter  4.  A  logistic 
model  was  used  in  [5]  to  describe  the  dependence  on  t: 

^a+bt 

-  I  ^a+b  t  ■ 

A  high  value  of  a  -|-  6  •  t  corresponds  to  a  high  value  of  p{t),  a  low  value  to 
low  p{t).  Values  of  a  and  b  were  determined  from  the  data,  according  to  the 
following  principle:  choose  a  and  b  so  that  the  probability  that  we  get  data  as 
in  Figure  1.3  is  as  high  as  possible.  This  is  an  example  of  the  use  of  the  method 
of  maximum  likelihood,  which  we  shall  discuss  in  Chapter  21.  This  results  in 
a  =  5.085  and  b  =  —0.1156,  which  indeed  leads  to  lower  probabilities  at  higher 
temperatures,  and  to  p(31)  =  0.8178.  We  can  also  compute  the  (estimated) 
expected  number  of  failures,  6  ■p(t),  as  a  function  of  the  launch  temperature  t; 
this  is  the  plotted  line  in  the  figure. 

Combining  the  estimates  with  estimated  probabilities  of  other  events  that 
should  happen  for  a  complete  failure  of  the  field-joint,  the  estimated  proba¬ 
bility  of  such  a  failure  is  0.023.  With  six  field-joints,  the  probability  of  at  least 
one  complete  failure  is  then  1  —  (1  —  0.023)®  =  0.13! 

1.5  Statistics  versus  intelligence  agencies 

During  World  War  II,  information  about  Germany’s  war  potential  was  essen¬ 
tial  to  the  Allied  forces  in  order  to  schedule  the  time  of  invasions  and  to  carry 
out  the  allied  strategic  bombing  program.  Methods  for  estimating  German 
production  used  during  the  early  phases  of  the  war  proved  to  be  inadequate. 
In  order  to  obtain  more  reliable  estimates  of  German  war  production,  ex¬ 
perts  from  the  Economic  Warfare  Division  of  the  American  Embassy  and  the 
British  Ministry  of  Economic  Warfare  started  to  analyze  markings  and  serial 
numbers  obtained  from  captured  German  equipment. 

Each  piece  of  enemy  equipment  was  labeled  with  markings,  which  included 
all  or  some  portion  of  the  following  information:  (a)  the  name  and  location 
of  the  marker;  (b)  the  date  of  manufacture;  (c)  a  serial  number;  and  (d) 
miscellaneous  markings  such  as  trademarks,  mold  numbers,  casting  numbers, 
etc.  The  purpose  of  these  markings  was  to  maintain  an  effective  check  on 
production  standards  and  to  perform  spare  parts  control.  However,  these  same 
markings  offered  Allied  intelligence  a  wealth  of  information  about  German 
industry. 

The  first  products  to  be  analyzed  were  tires  taken  from  German  aircraft  shot 
over  Britain  and  from  supply  dumps  of  aircraft  and  motor  vehicle  tires  cap¬ 
tured  in  North  Africa.  The  marking  on  each  tire  contained  the  maker’s  name. 
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a  serial  number,  and  a  two-letter  code  for  the  date  of  manufacture.  The  first 
step  in  analyzing  the  tire  markings  involved  breaking  the  two-letter  date  code. 
It  was  conjectured  that  one  letter  represented  the  month  and  the  other  the 
year  of  manufacture,  and  that  there  should  be  12  letter  variations  for  the 
month  code  and  3  to  6  for  the  year  code.  This,  indeed,  turned  out  to  be  true. 
The  following  table  presents  examples  of  the  12  letter  variations  used  by  four 
different  manufacturers. 


Jan 

Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 

Dunlop 

T 

I 

E 

B 

R 

A 

P 

O 

L 

N 

U 

D 

Fulda 

F 

U 

L 

D 

A 

M 

U 

N 

S 

T 

E 

R 

Phoenix 

F 

o 

N 

I 

X 

H 

A 

M 

B 

U 

R 

G 

Sempirit 

A 

B 

C 

D 

E 

F 

G 

H 

I 

J 

K 

L 

Reprinted  with  permission  from  “An  empirical  approach  to  economic  intelli¬ 
gence”  by  R.Ruggles  and  H.Brodie,  pp. 72-91,  Vol.  42,  No.  237.  @  1947  by 
the  American  Statistical  Association.  All  rights  reserved. 


For  instance,  the  Dunlop  code  was  Dunlop  Arbeit  spelled  backwards.  Next, 
the  year  code  was  broken  and  the  numbering  system  was  solved  so  that  for 
each  manufacturer  individually  the  serial  numbers  could  be  dated.  Moreover, 
for  each  month,  the  serial  numbers  could  be  recoded  to  numbers  running 
from  1  to  some  unknown  largest  number  N,  and  the  observed  (recoded)  serial 
numbers  could  be  seen  as  a  subset  of  this.  The  objective  was  to  estimate  N 
for  each  month  and  each  manufacturer  separately  by  means  of  the  observed 
(recoded)  serial  numbers.  In  Chapter  20  we  discuss  two  different  methods 
of  estimation,  and  we  show  that  the  method  based  on  only  the  maximum 
observed  (recoded)  serial  number  is  much  better  than  the  method  based  on 
the  average  observed  (recoded)  serial  numbers. 

With  a  sample  of  about  1400  tires  from  five  producers,  individual  monthly 
output  figures  were  obtained  for  almost  all  months  over  a  period  from  1939 
to  mid-1943.  The  following  table  compares  the  accuracy  of  estimates  of  the 
average  monthly  production  of  all  manufacturers  of  the  first  quarter  of  1943 
with  the  statistics  of  the  Speer  Ministry  that  became  available  after  the  war. 
The  accuracy  of  the  estimates  can  be  appreciated  even  more  if  we  compare 
them  with  the  figures  obtained  by  Allied  intelligence  agencies.  They  estimated, 
using  other  methods,  the  production  between  900  000  and  1 200  000  per  month! 


Type  of  tire 

Estimated  production 

Actual  production 

Truck  and  passenger  car 

147000 

159  000 

Aircraft 

28  500 

26400 

Total 

175  500 

186100 

Reprinted  with  permission  from  “An  empirical  approach  to  economic  intelli¬ 
gence”  by  R.Ruggles  and  H.Brodie,  pp. 72-91,  Vol.  42,  No.  237.  (c)  1947  by 
the  American  Statistical  Association.  All  rights  reserved. 
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1.6  The  speed  of  light 

In  1983  the  definition  of  the  meter  (the  SI  unit  of  one  meter)  was  changed  to: 
The  meter  is  the  length  of  the  path  traveled  by  light  in  vacuum  during  a  time 
interval  of  1  /299  792  458  of  a  second.  This  implicitly  defines  the  speed  of  light 
as  299  792  458  meters  per  second.  It  was  done  because  one  thought  that  the 
speed  of  light  was  so  accurately  known  that  it  made  more  sense  to  define  the 
meter  in  terms  of  the  speed  of  light  rather  than  vice  versa,  a  remarkable  end 
to  a  long  story  of  scientific  discovery.  For  a  long  time  most  scientists  believed 
that  the  speed  of  light  was  infinite.  Early  experiments  devised  to  demonstrate 
the  finiteness  of  the  speed  of  light  failed  because  the  speed  is  so  extraordi¬ 
narily  high.  In  the  18th  century  this  debate  was  settled,  and  work  started  on 
determination  of  the  speed,  using  astronomical  observations,  but  a  century 
later  scientists  turned  to  earth-based  experiments.  Albert  Michelson  refined 
experimental  arrangements  from  two  previous  experiments  and  conducted  a 
series  of  measurements  in  June  and  early  July  of  1879,  at  the  U.S.  Naval 
Academy  in  Annapolis.  In  this  section  we  give  a  very  short  summary  of  his 
work.  It  is  extracted  from  an  article  in  Statistical  Science  ([18]). 

The  principle  of  speed  measurement  is  easy,  of  course:  measure  a  distance  and 
the  time  it  takes  to  travel  that  distance,  the  speed  equals  distance  divided  by 
time.  For  an  accurate  determination,  both  the  distance  and  the  time  need 
to  be  measured  accurately,  and  with  the  speed  of  light  this  is  a  problem: 
either  we  should  use  a  very  large  distance  and  the  accuracy  of  the  distance 
measurement  is  a  problem,  or  we  have  a  very  short  time  interval,  which  is  also 
very  difficult  to  measure  accurately. 

In  Michelson’s  time  it  was  known  that  the  speed  of  light  was  about  300  000 
km/s,  and  he  embarked  on  his  study  with  the  goal  of  an  improved  value  of  the 
speed  of  light.  His  experimental  setup  is  depicted  schematically  in  Figure  1.4. 
Light  emitted  from  a  light  source  is  aimed,  through  a  slit  in  a  fixed  plate, 
at  a  rotating  mirror;  we  call  its  distance  from  the  plate  the  radius.  At  one 
particular  angle,  this  rotating  mirror  reflects  the  beam  in  the  direction  of  a 
distant  (fixed)  fiat  mirror.  On  its  way  the  light  first  passes  through  a  focusing 
lens.  This  second  mirror  is  positioned  in  such  a  way  that  it  reflects  the  beam 
back  in  the  direction  of  the  rotating  mirror.  In  the  time  it  takes  the  light  to 
travel  back  and  forth  between  the  two  mirrors,  the  rotating  mirror  has  moved 
by  an  angle  a,  resulting  in  a  reflection  on  the  plate  that  is  displaced  with 
respect  to  the  source  beam  that  passed  through  the  slit.  The  radius  and  the 
displacement  determine  the  angle  a  because 

displacement 
tan  2q!  = - - - 

radius 

and  combined  with  the  number  of  revolutions  per  seconds  (rps)  of  the  mirror, 
this  determines  the  elapsed  time: 

Q;/27r 
time  =  - . 

rps 
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1  Why  probability  and  statistics? 


Distance 


During  this  time  the  light  traveled  twice  the  distance  between  the  mirrors,  so 
the  speed  of  light  in  air  now  follows: 

2  •  distance 

•^air  —  7"^  * 

time 

All  in  all,  it  looks  simple:  just  measure  the  four  quantities — distance,  radius, 
displacement  and  the  revolutions  per  second — and  do  the  calculations.  This 
is  much  harder  than  it  looks,  and  problems  in  the  form  of  inaccuracies  are 
lurking  everywhere.  An  error  in  any  of  these  quantities  translates  directly  into 
some  error  in  the  final  result. 

Michelson  did  the  utmost  to  reduce  errors.  For  example,  the  distance  between 
the  mirrors  was  about  2000  feet,  and  to  measure  it  he  used  a  steel  measuring 
tape.  Its  nominal  length  was  100  feet,  but  he  carefully  checked  this  using  a 
copy  of  the  official  “standard  yard.”  He  found  that  the  tape  was  in  fact  100.006 
feet.  This  way  he  eliminated  a  (small)  systematic  error. 

Now  imagine  using  the  tape  to  measure  a  distance  of  2000  feet:  you  have  to  use 
the  tape  20  times,  each  time  marking  the  next  100  feet.  Do  it  again,  and  you 
probably  find  a  slightly  different  answer,  no  matter  how  hard  you  try  to  be 
very  precise  in  every  step  of  the  measuring  procedure.  This  kind  of  variation 
is  inevitable:  sometimes  we  end  up  with  a  value  that  is  a  bit  too  high,  other 
times  it  is  too  low,  but  on  average  we’re  doing  okay — assuming  that  we  have 
eliminated  sources  of  systematic  error,  as  in  the  measuring  tape.  Michelson 
measured  the  distance  five  times,  which  resulted  in  values  between  1984.93 
and  1985.17  feet  (after  correcting  for  the  temperature-dependent  stretch),  and 
he  used  the  average  as  the  “true  distance.” 

In  many  phases  of  the  measuring  process  Michelson  attempted  to  identify 
and  determine  systematic  errors  and  subsequently  applied  corrections.  He 


1.6  The  speed  of  light 
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also  systematically  repeated  measuring  steps  and  averaged  the  results  to  re¬ 
duce  variability.  His  final  dataset  consists  of  100  separate  measurements  (see 
Table  17.1),  but  each  is  in  fact  summarized  and  averaged  from  repeated  mea¬ 
surements  on  several  variables.  The  final  result  he  reported  was  that  the  speed 
of  light  in  vacuum  (this  involved  a  conversion)  was  299  944  ±  51  km/s,  where 
the  51  is  an  indication  of  the  uncertainty  in  the  answer.  In  retrospect,  we  must 
conclude  that,  in  spite  of  Michelson’s  admirable  meticulousness,  some  source 
of  error  must  have  slipped  his  attention,  as  his  result  is  off  by  about  150  km/s. 
With  current  methods  we  would  derive  from  his  data  a  so-called  95%  confi¬ 
dence  interval:  299  944  ±  15.5  km/s,  suggesting  that  Michelson’s  uncertainty 
analysis  was  a  little  conservative.  The  methods  used  to  construct  confidence 
intervals  are  the  topic  of  Chapters  23  and  24. 


2 


Outcomes,  events,  and  probability 


The  world  around  us  is  full  of  phenomena  we  perceive  as  random  or  unpre¬ 
dictable.  We  aim  to  model  these  phenomena  as  outcomes  of  some  experiment, 
where  you  should  think  of  experiment  in  a  very  general  sense.  The  outcomes 
are  elements  of  a  sample  space  fl,  and  subsets  of  are  called  events. The  events 
will  be  assigned  a  prohahility,  a  number  between  0  and  1  that  expresses  how 
likely  the  event  is  to  occur. 


2.1  Sample  spaces 

Sample  spaces  are  simply  sets  whose  elements  describe  the  outcomes  of  the 
experiment  in  which  we  are  interested. 

We  start  with  the  most  basic  experiment:  the  tossing  of  a  coin.  Assuming  that 
we  will  never  see  the  coin  land  on  its  rim,  there  are  two  possible  outcomes: 
heads  and  tails.  We  therefore  take  as  the  sample  space  associated  with  this 
experiment  the  set  11  =  {il,  T}. 

In  another  experiment  we  ask  the  next  person  we  meet  on  the  street  in  which 
month  her  birthday  falls.  An  obvious  choice  for  the  sample  space  is 

n  =  {Jan,  Feb,  Mar,  Apr,  May,  Jun,  Jul,  Aug,  Sep,  Oct,  Nov,  Dec}. 

In  a  third  experiment  we  load  a  scale  model  for  a  bridge  up  to  the  point 
where  the  structure  collapses.  The  outcome  is  the  load  at  which  this  occurs. 
In  reality,  one  can  only  measure  with  finite  accuracy,  e.g.,  to  five  decimals,  and 
a  sample  space  with  just  those  numbers  would  strictly  be  adequate.  However, 
in  principle,  the  load  itself  could  be  any  positive  number  and  therefore  12  = 
(0,oo)  is  the  right  choice.  Even  though  in  reality  there  may  also  be  an  upper 
limit  to  what  loads  are  conceivable,  it  is  not  necessary  or  practical  to  try  to 
limit  the  outcomes  correspondingly. 
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2  Outcomes,  events,  and  probability 


In  a  fourth  experiment,  we  find  on  our  doormat  three  envelopes,  sent  to  us  by 
three  different  persons,  and  we  look  in  which  order  the  envelopes  lie  on  top  of 
each  other.  Coding  them  1,  2,  and  3,  the  sample  space  would  be 

O  =  {123, 132, 213, 231, 312, 321}. 

Quick  exercise  2.1  If  we  received  mail  from  four  different  persons,  how 
many  elements  would  the  corresponding  sample  space  have? 

In  general  one  might  consider  the  order  in  which  n  different  objects  can  be 
placed.  This  is  called  a  permutation  of  the  n  objects.  As  we  have  seen,  there 
are  6  possible  permutations  of  3  objects,  and  4  •  6  =  24  of  4  objects.  What 
happens  is  that  if  we  add  the  nth  object,  then  this  can  be  placed  in  any  of  n 
positions  in  any  of  the  permutations  of  n  —  1  objects.  Therefore  there  are 

n  •  (n  —  1)  •  •  •  •  3  •  2  •  1  =  n! 

possible  permutations  of  n  objects.  Here  n\  is  the  standard  notation  for  this 
product  and  is  pronounced  “n  factorial.”  It  is  convenient  to  define  0!  =  1. 


2.2  Events 

Subsets  of  the  sample  space  are  called  events .  We  say  that  an  event  A  occurs 
if  the  outcome  of  the  experiment  is  an  element  of  the  set  A.  For  example,  in 
the  birthday  experiment  we  can  ask  for  the  outcomes  that  correspond  to  a 
long  month,  i.e.,  a  month  with  31  days.  This  is  the  event 

L  =  {Jan,  Mar,  May,  Jul,  Aug,  Oct,  Dec}. 

Events  may  be  combined  according  to  the  usual  set  operations. 

For  example  if  R  is  the  event  that  corresponds  to  the  months  that  have  the 
letter  r  in  their  (full)  name  (so  R  =  {Jan,  Feb,  Mar,  Apr,  Sep,  Oct,  Nov,  Dec}), 
then  the  long  months  that  contain  the  letter  r  are 

Lr\R=  {Jan,  Mar,  Oct,  Dec}. 

The  set  Lni?  is  called  the  intersection  of  L  and  R  and  occurs  if  both  L  and  R 
occur.  Similarly,  we  have  the  union  AUB  of  two  sets  A  and  B,  which  occurs  if 
at  least  one  of  the  events  A  and  B  occurs.  Another  common  operation  is  taking 
complements.  The  event  =  {w  €  O  :  w  ^  A}  is  called  the  complement  of  A; 
it  occurs  if  and  only  if  A  does  not  occur.  The  complement  of  O  is  denoted 
0,  the  empty  set,  which  represents  the  impossible  event.  Figure  2.1  illustrates 
these  three  set  operations. 


2.2  Events 
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Fig.  2.1.  Diagrams  of  intersection,  union,  and  complement. 


We  call  events  A  and  B  disjoint  or  mutually  exclusive  if  A  and  B  have  no 
outcomes  in  common;  in  set  terminology:  ACiB  =  0.  For  example,  the  event  L 
“the  birthday  falls  in  a  long  month”  and  the  event  {Feb}  are  disjoint. 

Finally,  we  say  that  event  A  implies  event  B  if  the  outcomes  of  A  also  lie 
in  B.  In  set  notation:  A  C  B-,  see  Figure  2.2. 

Some  people  like  to  use  double  negations: 

“It  is  certainly  not  true  that  neither  John  nor  Mary  is  to  blame.” 

This  is  equivalent  to:  “John  or  Mary  is  to  blame,  or  both.”  The  following 
useful  rules  formalize  this  mental  operation  to  a  manipulation  with  events. 


DeMorgan’s  laws.  For  any  two  events  A  and  B  we  have 
{A  U  BY  =  A^r]B‘=  and  {A  n  BY  =  U  BF 


Quick  exercise  2.2  Let  J  be  the  event  “John  is  to  blame”  and  M  the  event 
“Mary  is  to  blame.”  Express  the  two  statements  above  in  terms  of  the  events 
J,  and  M‘^,  and  check  the  equivalence  of  the  statements  by  means  of 

DeMorgan’s  laws. 


Fig.  2.2.  Minimal  and  maximal  intersection  of  two  sets. 


16 


2  Outcomes,  events,  and  probability 


2.3  Probability 

We  want  to  express  how  likely  it  is  that  an  event  occurs.  To  do  this  we  will 
assign  a  probability  to  each  event.  The  assignment  of  probabilities  to  events  is 
in  general  not  an  easy  task,  and  some  of  the  coming  chapters  will  be  dedicated 
directly  or  indirectly  to  this  problem.  Since  each  event  has  to  be  assigned  a 
probability,  we  speak  of  a  probability  function.  It  has  to  satisfy  two  basic 
properties. 


Definition.  A  probability  function  P  on  a  finite  sample  space 
assigns  to  each  event  A  in  a  number  P(A)  in  [0,1]  such  that 

(i)  P(fl)  =  1,  and 

(ii)  P(A  U  B)  =  P(A)  +  P{B)  if  A  and  B  are  disjoint. 

The  number  P(A)  is  called  the  probability  that  A  occurs. 


Property  (i)  expresses  that  the  outcome  of  the  experiment  is  always  an  element 
of  the  sample  space,  and  property  (ii)  is  the  additivity  property  of  a  probability 
function.  It  implies  additivity  of  the  probability  function  over  more  than  two 
sets;  e.g.,  if  A,  B,  and  C  are  disjoint  events,  then  the  two  events  AVJ  B  and 
C  are  also  disjoint,  so 

P(A  U  S  U  C)  =  P(A  U  B)  +  P(C7)  =  P(A)  +  P(B)  +  P(C') . 

We  will  now  look  at  some  examples.  When  we  want  to  decide  whether  Peter 
or  Paul  has  to  wash  the  dishes,  we  might  toss  a  coin.  The  fact  that  we  consider 
this  a  fair  way  to  decide  translates  into  the  opinion  that  heads  and  tails  are 
equally  likely  to  occur  as  the  outcome  of  the  coin-tossing  experiment.  So  we 
put 

P({iJ})  =  P({T})  =  i. 

Formally  we  have  to  write  {i^}  for  the  set  consisting  of  the  single  element  H, 
because  a  probability  function  is  defined  on  events,  not  on  outcomes.  From 
now  on  we  shall  drop  these  brackets. 

Now  it  might  happen,  for  example  due  to  an  asymmetric  distribution  of  the 
mass  over  the  coin,  that  the  coin  is  not  completely  fair.  For  example,  it  might 
be  the  case  that 

P(iJ)  =  0.4999  and  P(r)  =  0.5001. 

More  generally  we  can  consider  experiments  with  two  possible  outcomes,  say 
“failure”  and  “success”,  which  have  probabilities  1—  p  and  p  to  occur,  where 
p  is  a  number  between  0  and  1.  For  example,  when  our  experiment  consists 
of  buying  a  ticket  in  a  lottery  with  10  000  tickets  and  only  one  prize,  where 
“success”  stands  for  winning  the  prize,  then  p  =  10“^. 

How  should  we  assign  probabilities  in  the  second  experiment,  where  we  ask 
for  the  month  in  which  the  next  person  we  meet  has  his  or  her  birthday?  In 
analogy  with  what  we  have  just  done,  we  put 
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P(Jan)  =  P(Feb)  =  •  •  •  =  P(Dec)  = 

Some  of  you  might  object  to  this  and  propose  that  we  put,  for  example, 

31  30 

P(Jan)  = -  and  P(Apr)  =  - , 

^  365  ^  ^  ^  365’ 

because  we  have  long  months  and  short  months.  But  then  the  very  precise 
among  us  might  remark  that  this  does  not  yet  take  care  of  leap  years. 

Quick  exercise  2.3  If  you  would  take  care  of  the  leap  years,  assuming  that 
one  in  every  four  years  is  a  leap  year  (which  again  is  an  approximation  to 
reality!),  how  would  you  assign  a  probability  to  each  month? 

In  the  third  experiment  (the  buckling  load  of  a  bridge),  where  the  outcomes  are 
real  numbers,  it  is  impossible  to  assign  a  positive  probability  to  each  outcome 
(there  are  just  too  many  outcomes!).  We  shall  come  back  to  this  problem  in 
Chapter  5,  restricting  ourselves  in  this  chapter  to  finite  and  countably  infinite^ 
sample  spaces. 

In  the  fourth  experiment  it  makes  sense  to  assign  equal  probabilities  to  all  six 
outcomes: 

P(123)  =  P(132)  =  P(213)  =  P(231)  =  P(312)  =  P(321)  =  i. 

Until  now  we  have  only  assigned  probabilities  to  the  individual  outcomes  of  the 
experiments.  To  assign  probabilities  to  events  we  use  the  additivity  property. 
For  instance,  to  find  the  probability  P{T)  of  the  event  T  that  in  the  three 
envelopes  experiment  envelope  2  is  on  top  we  note  that 

P(r)=P(213)  +  P(231)  =  i  +  i  =  i. 

In  general,  additivity  of  P  implies  that  the  probability  of  an  event  is  obtained 
by  summing  the  probabilities  of  the  outcomes  belonging  to  the  event. 

Quick  exercise  2.4  Compute  P(T)  and  P(i?)  in  the  birthday  experiment. 

Finally  we  mention  a  rule  that  permits  us  to  compute  probabilities  of  events 
A  and  B  that  are  not  disjoint.  Note  that  we  can  write  A  =  {A\^B)  U  (Ani?'^), 
which  is  a  disjoint  union;  hence 

P(A)  =  P(A  n  B)  +  P(A  n  B=) . 

If  we  split  AU  B  in  the  same  way  with  B  and  B‘^,  we  obtain  the  events 
(AUB)  n  B,  which  is  simply  B  and  (AUB)  n  B'^,  which  is  nothing  but  AnB'^. 

^  This  means:  although  infinite,  we  can  still  count  them  one  by  one;  U  = 
{u>i,u!2,  . . .  }.  The  interval  [0,1]  of  real  numbers  is  an  example  of  an  uncountable 
sample  space. 


18 


2  Outcomes,  events,  and  probability 


Thus 


P(^  U  S)  =  P{B)  +  P{A  n  B^) . 


Eliminating  P(A  n  B‘^)  from  these  two  equations  we  obtain  the  following  rule. 


The  probability  of  a  union.  For  any  two  events  A  and  B  we 
have 


P{A  UB)=  P(A)  +  P(B)  -  P{A  n  B) . 


From  the  additivity  property  we  can  also  find  a  way  to  compute  probabilities 
of  complements  of  events:  from  A  U  =  O,  we  deduce  that 

P{A‘=)  =  l-P(^). 


2.4  Products  of  sample  spaces 

Basic  to  statistics  is  that  one  usually  does  not  consider  one  experiment,  but 
that  the  same  experiment  is  performed  several  times.  For  example,  suppose 
we  throw  a  coin  two  times.  What  is  the  sample  space  associated  with  this  new 
experiment?  It  is  clear  that  it  should  be  the  set 

=  {H,  T}  X  {H,  T}  =  {{H,  H),  (H,  T),  (T,  H),  (T,  T)}. 

If  in  the  original  experiment  we  had  a  fair  coin,  i.e.,  P{H)  =  P(T),  then  in 
this  new  experiment  all  4  outcomes  again  have  equal  probabilities: 

P((iJ,  H))  =  P{{H,  T))  =  P((T,  H))  =  P((T,  T))  =  i 

Somewhat  more  generally,  if  we  consider  two  experiments  with  sample  spaces 
and  PI2  then  the  combined  experiment  has  as  its  sample  space  the  set 

X  ^2  =  {{oJi,  UJ2)  :  G  G  fl2}- 

If  fli  has  r  elements  and  Pt2  has  s  elements,  then  fli  x  Pt2  has  rs  elements. 
Now  suppose  that  in  the  first,  the  second,  and  the  combined  experiment  all 
outcomes  are  equally  likely  to  occur.  Then  the  outcomes  in  the  first  experi¬ 
ment  have  probability  l/r  to  occur,  those  of  the  second  experiment  1/s,  and 
those  of  the  combined  experiment  probability  1/rs.  Motivated  by  the  fact  that 
1/rs  =  (l/r)  X  (1/s),  we  will  assign  probability  piPj  to  the  outcome  (cOi,u!j) 
in  the  combined  experiment,  in  the  case  that  uii  has  probability  pi  and  coj  has 
probability  pj  to  occur.  One  should  realize  that  this  is  by  no  means  the  only 
way  to  assign  probabilities  to  the  outcomes  of  a  combined  experiment.  The 
preceding  choice  corresponds  to  the  situation  where  the  two  experiments  do 
not  influence  each  other  in  any  way.  What  we  mean  by  this  influence  will  be 
explained  in  more  detail  in  the  next  chapter. 


2.5  An  infinite  sample  space 
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Quick  exercise  2.5  Consider  the  sample  space  {oi,  02, 03, 04, 05,  a^}  of  some 
experiment,  where  outcome  Qi  has  probability  pi  for  i  =  1, . . . ,  6.  We  perform 
this  experiment  twice  in  such  a  way  that  the  associated  probabilities  are 


P{{a^,ai))  =  Pi,  and  aj))  =  0  if  i  ^  j,  for  j  =  1, . . . ,  6. 

Check  that  P  is  a  probability  function  on  the  sample  space  =  {oi, . . . ,  a^}  x 
{ai,...,a6}  of  the  combined  experiment.  What  is  the  relationship  between 
the  first  experiment  and  the  second  experiment  that  is  determined  by  this 
probability  function? 

We  started  this  section  with  the  experiment  of  throwing  a  coin  twice.  If  we 
want  to  learn  more  about  the  randomness  associated  with  a  particular  exper¬ 
iment,  then  we  should  repeat  it  more  often,  say  n  times.  For  example,  if  we 
perform  an  experiment  with  outcomes  1  (success)  and  0  (failure)  five  times, 
and  we  consider  the  event  A  “exactly  one  experiment  was  a  success,”  then 
this  event  is  given  by  the  set 


A  ={(0,0,0, 0,1),  (0,0, 0,1,0),  (0,0, 1,0,0),  (0,1,0, 0,0),  (1,0, 0,0,0)} 


in  n  =  {0,1}  X  {0,1}  X  {0,1}  X  {0,1}  x  {0,1}.  Moreover,  if  success  has 
probability  p  and  failure  probability  1  —  p,  then 

P(A)  =  5  •  (1  -p)^  -p, 

since  there  are  five  outcomes  in  the  event  A,  each  having  probability  (1— p)^-p. 

Quick  exercise  2.6  What  is  the  probability  of  the  event  B  “exactly  two 
experiments  were  successful”? 

In  general,  when  we  perform  an  experiment  n  times,  then  the  corresponding 
sample  space  is 

n  =  fii  X  n2  X  •  •  •  X  n„, 

where  fli  for  i  =  1, ...  ,n  is  a  copy  of  the  sample  space  of  the  original  exper¬ 
iment.  Moreover,  we  assign  probabilities  to  the  outcomes  (wi, . . .  ,a;„)  in  the 
standard  way  described  earlier,  i.e.. 


P((uji,u;2,  ■  ■  ■  =Pl  •P2 - Pn, 


if  each  u^i  has  probability  pi . 


2.5  An  infinite  sample  space 

We  end  this  chapter  with  an  example  of  an  experiment  with  infinitely  many 
outcomes.  We  toss  a  coin  repeatedly  until  the  first  head  turns  up.  The  outcome 
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of  the  experiment  is  the  number  of  tosses  it  takes  to  have  this  first  occurrence 
of  a  head.  Our  sample  space  is  the  space  of  all  positive  natural  numbers 

0={1,2,3,...}. 

What  is  the  probability  function  P  for  this  experiment? 

Suppose  the  coin  has  probability  p  of  falling  on  heads  and  probability  1  —  p  to 
fall  on  tails,  where  0  <  p  <  1.  We  determine  the  probability  P(n)  for  each  n. 
Clearly  P(l)  =  p,  the  probability  that  we  have  a  head  right  away.  The  event 
{2}  corresponds  to  the  outcome  (T,  H)  in  {H,T}  x  {H,T},  so  we  should  have 

P(2)  =  (l-p)p. 

Similarly,  the  event  {n}  corresponds  to  the  outcome  (T,  T, . . . ,  T,  T,  H)  in  the 
space  {H,T}  x  •••  x  {H,T}.  Hence  we  should  have,  in  general, 

P(n)  =  (1 -p)”"V  n=  1,2,3,.... 

Does  this  define  a  probability  function  on  n  =  {1,2,3,...}?  Then  we  should 
at  least  have  P(n)  =  1.  It  is  not  directly  clear  how  to  calculate  P(fl):  since 
the  sample  space  is  no  longer  finite  we  have  to  amend  the  definition  of  a 
probability  function. 


Definition.  A  probability  function  on  an  infinite  (or  finite)  sample 
space  n  assigns  to  each  event  A  in  a  number  P(A)  in  [0, 1]  such 
that 

(i)  P(fl)  =  1,  and 

(ii)  P(Ai  U  A2  U  A3  U  •  •  •)  =  P(Ai)  +  P(A2)  +  P(A3)  +  •  •  • 
if  Ai,  A2,  A3, . . .  are  disjoint  events. 


Note  that  this  new  additivity  property  is  an  extension  of  the  previous  one 
because  if  we  choose  A3  =  A4  =  •  •  •  =  0,  then 

P(Ai  U  A2)  =  P(Ai  U  A2  U  0  U  0  U  •  •  •) 

=  P(Ai)  +  P(A2)  +  0  +  0  +  •  •  •  =  P(Ai)  +  P(A2)  . 


Now  we  can  compute  the  probability  of 


P(fl)  =  P(l)  +  P(2)  +  •  •  •  +  P(n)  +  ••• 

=  p  +  (1  -  p)pH - (1  -p)”"Vh - 

=  p[l  +  (l-p)  +  ---(l-p)"-i +  •••]. 

The  sum  1  +  (1  —  p)  +  •  •  •  +  (1  —  p)"“^  +  •  •  •  is  an  example  of  a  geometric 
series.  It  is  well  known  that  when  |1  —  p|  <  1, 


l  +  (l-p)  +  ---  +  (l-p)'‘-^  +  ---  = 


Therefore  we  do  indeed  have  P(n)  =  p  •  -  =  1. 

P 


1 


1  -  (1  -p) 


1 

P' 
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Quick  exercise  2.7  Suppose  an  experiment  in  a  laboratory  is  repeated  every 
day  of  the  week  until  it  is  successful,  the  probability  of  success  being  p.  The 
first  experiment  is  started  on  a  Monday.  What  is  the  probability  that  the 
series  ends  on  the  next  Sunday? 


2.6  Solutions  to  the  quick  exercises 

2.1  The  sample  space  is  =  {1234, 1243, 1324, 1342, . . . ,  4321}.  The  best  way 
to  count  its  elements  is  by  noting  that  for  each  of  the  6  outcomes  of  the  three- 
envelope  experiment  we  can  put  a  fourth  envelope  in  any  of  4  positions.  Hence 
n  has  4  •  6  =  24  elements. 

2.2  The  statement  “It  is  certainly  not  true  that  neither  John  nor  Mary  is  to 
blame”  corresponds  to  the  event  ( The  statement  “John  or  Mary  is 
to  blame,  or  both”  corresponds  to  the  event  J  U  M.  Equivalence  now  follows 
from  DeMorgan’s  laws. 

2.3  In  four  years  we  have  365  x  3-1-366  =  1461  days.  Hence  long  months  each 
have  a  probability  4  x  31/1461  =  124/1461,  and  short  months  a  probability 
120/1461  to  occur.  Moreover,  {Feb}  has  probability  113/1461. 

2.4  Since  there  are  7  long  months  and  8  months  with  an  “r”  in  their  name, 
we  have  P(T)  =  7/12  and  P(i?)  =  8/12. 

2.5  Checking  that  P  is  a  probability  function  amounts  to  verifying  that 
0  <  P((ai,  a^))  <  1  for  all  i  and  j  and  noting  that 

6  6  6 

P(^)  = 

i,j—l  i—1  i—1 

The  two  experiments  are  totally  coupled:  one  has  outcome  Qi  if  and  only  if 
the  other  has  outcome  a^. 

2.6  Now  there  are  10  outcomes  in  B  (for  example  (0,1, 0,1,0)),  each  having 
probability  (1  —p)^p^.  Hence  P(iJ)  =  10(1  —  pYp^. 

2.7  This  happens  if  and  only  if  the  experiment  fails  on  Monday,. . . ,  Saturday, 
and  is  a  success  on  Sunday.  This  has  probability  p(l  —  p)®  to  happen. 


2.7  Exercises 

2.1  □  Let  A  and  B  be  two  events  in  a  sample  space  for  which  P(A)  =  2/3, 
P(H)  =  1/6,  and  P(H  n  5)  =  1/9.  What  is  P(H  U  B)1 
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2.2  Let  E  and  F  be  two  events  for  which  one  knows  that  the  probability  that 
at  least  one  of  them  occurs  is  3/4.  What  is  the  probability  that  neither  E  nor 
F  occurs?  Hint:  use  one  of  DeMorgan’s  laws:  E'^  f]  F'^  =  {E  Fy. 

2.3  Let  C  and  D  be  two  events  for  which  one  knows  that  P(C')  =  0.3,  P{D)  = 
0.4,  and  P(C'  D  D)  =  0.2.  What  is  P(C'“  n  D)? 

2.4  □  We  consider  events  A,B,  and  C,  which  can  occur  in  some  experiment. 
Is  it  true  that  the  probability  that  only  A  occurs  (and  not  B  or  C)  is  equal 
to  P{A  UBUC)-  P{B)  -  P(C)  +  P(B  n  C)? 

2.5  The  event  AnB‘^  that  A  occurs  but  not  B  is  sometimes  denoted  as  A\B. 
Here  \  is  the  set-theoretic  minus  sign.  Show  that  P(H  \B)  =  P(H)  —  P(i?)  if 
B  implies  A,  i.e.,  \i  B  G  A. 

2.6  When  P(H)  =  1/3,  V{B)  =  1/2,  and  P{A  U  B)  =  3/4,  what  is 

a.  P{AnB)7 

b.  P(H“UH^)? 

2.7  □  Let  A  and  B  be  two  events.  Suppose  that  P(H)  =  0.4,  P(i?)  =  0.5,  and 
P(H  n  H)  =  0.1.  Find  the  probability  that  A  or  B  occurs,  but  not  both. 

2.8  ffl  Suppose  the  events  Di  and  Z?2  represent  disasters,  which  are  rare: 
P(?l^i)  <  10“®  and  P(??2)  <  10“®.  What  can  you  say  about  the  probability 
that  at  least  one  of  the  disasters  occurs?  What  about  the  probability  that 
they  both  occur? 

2.9  We  toss  a  coin  three  times.  For  this  experiment  we  choose  the  sample 
space 


n  =  {HHH,  THH,  HTH,  HHT,  TTH,  THE,  HTT,  TTT} 
where  T  stands  for  tails  and  H  for  heads. 

a.  Write  down  the  set  of  outcomes  corresponding  to  each  of  the  following 
events: 

A  :  “we  throw  tails  exactly  two  times.” 

B  :  “we  throw  tails  at  least  two  times.” 

C  :  “tails  did  not  appear  before  a  head  appeared.” 

D  :  “the  first  throw  results  in  tails.” 

b.  Write  down  the  set  of  outcomes  corresponding  to  each  of  the  following 
events:  Ay  H  U  (C  n  D),  and  A  n  DG 

2.10  In  some  sample  space  we  consider  two  events  A  and  B.  Let  C  be  the 
event  that  A  or  B  occurs,  but  not  both.  Express  C  in  terms  of  A  and  B,  using 
only  the  basic  operations  “union,”  “intersection,”  and  “complement.” 


2.7  Exercises 


23 


2.11  □  An  experiment  has  only  two  outcomes.  The  first  has  probability  p  to 
occur,  the  second  probability  p^.  What  is  pi 

2.12  ffl  In  the  UEFA  Euro  2004  playoffs  draw  10  national  football  teams 
were  matched  in  pairs.  A  lot  of  people  complained  that  “the  draw  was  not 
fair,”  because  each  strong  team  had  been  matched  with  a  weak  team  (this 
is  commercially  the  most  interesting).  It  was  claimed  that  such  a  matching 
is  extremely  unlikely.  We  will  compute  the  probability  of  this  “dream  draw” 
in  this  exercise.  In  the  spirit  of  the  three-envelope  example  of  Section  2.1 
we  put  the  names  of  the  5  strong  teams  in  envelopes  labeled  1,2,  3, 4,  and 
5  and  of  the  5  weak  teams  in  envelopes  labeled  6,7,  8,9,  and  10.  We  shuffle 
the  10  envelopes  and  then  match  the  envelope  on  top  with  the  next  envelope, 
the  third  envelope  with  the  fourth  envelope,  and  so  on.  One  particular  way 
a  “dream  draw”  occurs  is  when  the  five  envelopes  labeled  1,2, 3, 4, 5  are  in 
the  odd  numbered  positions  (in  any  order!)  and  the  others  are  in  the  even 
numbered  positions.  This  way  corresponds  to  the  situation  where  the  first 
match  of  each  strong  team  is  a  home  match.  Since  for  each  pair  there  are 
two  possibilities  for  the  home  match,  the  total  number  of  possibilities  for  the 
“dream  draw”  is  2®  =  32  times  as  large. 

a.  An  outcome  of  this  experiment  is  a  sequence  like  4,  9,  3, 7,  5, 10, 1,8,  2, 6  of 
labels  of  envelopes.  What  is  the  probability  of  an  outcome? 

b.  How  many  outcomes  are  there  in  the  event  “the  five  envelopes  labeled 
1,2, 3, 4, 5  are  in  the  odd  positions — in  any  order,  and  the  envelopes  la¬ 
beled  6, 7,  8, 9, 10  are  in  the  even  positions — in  any  order”? 

c.  What  is  the  probability  of  a  “dream  draw”  ? 

2.13  In  some  experiment  first  an  arbitrary  choice  is  made  out  of  four  pos¬ 
sibilities,  and  then  an  arbitrary  choice  is  made  out  of  the  remaining  three 
possibilities.  One  way  to  describe  this  is  with  a  product  of  two  sample  spaces 
{a,  6,  c,  d}: 

O  =  {a,  b,  c,  d}  X  {a,  b,  c,  d}. 

a.  Make  a  4x4  table  in  which  you  write  the  probabilities  of  the  outcomes. 

b.  Describe  the  event  “c  is  one  of  the  chosen  possibilities”  and  determine  its 
probability. 

2.14  ffl  Consider  the  Monty  Hall  “experiment”  described  in  Section  1.3.  The 
door  behind  which  the  car  is  parked  we  label  a,  the  other  two  b  and  c.  As  the 
sample  space  we  choose  a  product  space 

ft  =  {a,  6,  c}  X  {a,  b,  c}. 

Here  the  first  entry  gives  the  choice  of  the  candidate,  and  the  second  entry 
the  choice  of  the  quizmaster. 
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a.  Make  a  3x3  table  in  which  you  write  the  probabilities  of  the  outcomes. 
N.B.  You  should  realize  that  the  candidate  does  not  know  that  the  car 
is  in  a,  but  the  quizmaster  will  never  open  the  door  labeled  a  because  he 
knows  that  the  car  is  there.  You  may  assume  that  the  quizmaster  makes 
an  arbitrary  choice  between  the  doors  labeled  b  and  c,  when  the  candidate 
chooses  door  a. 

b.  Consider  the  situation  of  a  “no  switching”  candidate  who  will  stick  to  his 
or  her  choice.  What  is  the  event  “the  candidate  wins  the  car,”  and  what 
is  its  probability? 

c.  Consider  the  situation  of  a  “switching”  candidate  who  will  not  stick  to 
her  choice.  What  is  now  the  event  “the  candidate  wins  the  car,”  and  what 
is  its  probability? 

2.15  The  rule  Y’{A  U  B)  =  P(^)  +P(i?)  —  P(^  C  B)  from  Section  2.3  is  often 
useful  to  compute  the  probability  of  the  union  of  two  events.  What  would  be 
the  corresponding  rule  for  three  events  A,B,  and  Cl  It  should  start  with 

V{A\JB\JC)=  P(y4)  +  P{B)  +  P(C) - . 

Hint:  you  could  use  the  sum  rule  suitably,  or  you  could  make  a  diagram  as  in 
Figure  2.1. 

2.16  ffl  Three  events  E,F,  and  G  cannot  occur  simultaneously.  Further  it 
is  known  that  P(i?nF)  =  P(FnG)  =  P(if  C  G)  =  1/3.  Can  you  deter¬ 
mine  P{E)1 

Hint:  if  you  try  to  use  the  formula  of  Exercise  2.15  then  it  seems  that  you  do 
not  have  enough  information;  make  a  diagram  instead. 

2.17  A  post  office  has  two  counters  where  customers  can  buy  stamps,  etc. 
If  you  are  interested  in  the  number  of  customers  in  the  two  queues  that  will 
form  for  the  counters,  what  would  you  take  as  sample  space? 

2.18  In  a  laboratory,  two  experiments  are  repeated  every  day  of  the  week  in 
different  rooms  until  at  least  one  is  successful,  the  probability  of  success  be¬ 
ing  p  for  each  experiment.  Supposing  that  the  experiments  in  different  rooms 
and  on  different  days  are  performed  independently  of  each  other,  what  is  the 
probability  that  the  laboratory  scores  its  first  successful  experiment  on  day  nl 

2.19  □  We  repeatedly  toss  a  coin.  A  head  has  probability  p,  and  a  tail  prob¬ 
ability  1  —  p  to  occur,  where  0  <  p  <  1.  The  outcome  of  the  experiment  we 
are  interested  in  is  the  number  of  tosses  it  takes  until  a  head  occurs  for  the 
second  time. 

a.  What  would  you  choose  as  the  sample  space? 

b.  What  is  the  probability  that  it  takes  5  tosses? 
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Knowing  that  an  event  has  occurred  sometimes  forces  us  to  reassess  the  prob¬ 
ability  of  another  event;  the  new  probability  is  the  conditional  probability.  If 
the  conditional  probability  equals  what  the  probability  was  before,  the  events 
involved  are  called  independent.  Often,  conditional  probabilities  and  indepen¬ 
dence  are  needed  if  we  want  to  compute  probabilities,  and  in  many  other 
situations  they  simplify  the  work. 


3.1  Conditional  probability 

In  the  previous  chapter  we  encountered  the  events  L,  “born  in  a  long  month,” 
and  R,  “born  in  a  month  with  the  letter  r.”  Their  probabilities  are  easy  to 
compute:  since  L  =  {Jan,  Mar,  May,  Jul,  Aug,  Oct,  Dec}  and  R  =  (Jan,  Feb, 
Mar,  Apr,  Sep,  Oct,  Nov,  Dec},  one  finds 

P(L)  =  ^  and  P(i?)  =  ^- 

Now  suppose  that  it  is  known  about  the  person  we  meet  in  the  street  that 
he  was  born  in  a  “long  month,”  and  we  wonder  whether  he  was  born  in 
a  “month  with  the  letter  r.”  The  information  given  excludes  five  outcomes 
of  our  sample  space:  it  cannot  be  February,  April,  June,  September,  or 
November.  Seven  possible  outcomes  are  left,  of  which  only  four — those  in 
Rn  L  =  {Jan,  Mar,  Oct,  Dec} — are  favorable,  so  we  reassess  the  probability 
as  4/7.  We  call  this  the  conditional  probability  of  R  given  L,  and  we  write: 

P(i?|L)  =  l. 

This  is  not  the  same  as  P(i?  n  L),  which  is  1/3.  Also  note  that  P(i?  |  L)  is  the 
proportion  that  P(i?  n  L)  is  of  P(T). 
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Quick  exercise  3.1  Let  iV  =  be  the  event  “born  in  a  month  without  r.” 
What  is  the  conditional  probability  P(A^  |  L)1 

Recalling  the  three  envelopes  on  our  doormat,  consider  the  events  “envelope  1 
is  the  middle  one”  (call  this  event  A)  and  “envelope  2  is  the  middle  one”  {B). 
Then  P(A)  =  P(213  or  312)  =  1/3;  by  symmetry,  the  same  is  found  for  P(i?). 
We  say  that  the  envelopes  are  in  order  if  their  order  is  either  123  or  321. 
Suppose  we  know  that  they  are  not  in  order,  but  otherwise  we  do  not  know 
anything;  what  are  the  probabilities  of  A  and  R,  given  this  information? 

Let  C  be  the  event  that  the  envelopes  are  not  in  order,  so:  C  =  {123,  321}“  = 
{132,213,231,312}.  We  ask  for  the  probabilities  of  A  and  B,  given  that  C 
occurs.  Event  C  consists  of  four  elements,  two  of  which  also  belong  to  A: 
An  C  =  {213,  312},  so  P(yl  I  C)  =  1/2.  The  probability  of  A  n  C  is  half  of 
P(C').  No  element  of  C  also  belongs  to  B,  so  F{B  |  C)  =  0. 

Quick  exercise  3.2  Calculate  P(C'  |  A)  and  P(C'“  |  A  U  B). 

In  general,  computing  the  probability  of  an  event  A,  given  that  an  event  C 
occurs,  means  finding  which  fraction  of  the  probability  of  C  is  also  in  the 
event  A. 


Definition.  The  conditional  probability  of  A  given  C  is  given  by: 


P(A|C) 


p(Anc) 

P(C) 


provided  P(C')  >  0. 


Quick  exercise  3.3  Show  that  P(A  |  C)  +  P(A“  |  C)  =  1. 


This  exercise  shows  that  the  rule  P(^“)  =  1  —  P(A)  also  holds  for  conditional 
probabilities.  In  fact,  even  more  is  true:  if  we  have  a  fixed  conditioning  event  C 
and  define  Q{A)  =  P(A  |  C)  for  events  A  C  fl,  then  Q  is  a  probability  function 
and  hence  satisfies  all  the  rules  as  described  in  Chapter  2.  The  definition  of 
conditional  probability  agrees  with  our  intuition  and  it  also  works  in  situations 
where  computing  probabilities  by  counting  outcomes  does  not. 


A  chemical  reactor:  residence  times 

Consider  a  continuously  stirred  reactor  vessel  where  a  chemical  reaction  takes 
place.  On  one  side  fluid  or  gas  flows  in,  mixes  with  whatever  is  already  present 
in  the  vessel,  and  eventually  flows  out  on  the  other  side.  One  characteristic 
of  each  particular  reaction  setup  is  the  so-called  residence  time  distribution, 
which  tells  us  how  long  particles  stay  inside  the  vessel  before  moving  on.  We 
consider  a  continuously  stirred  tank:  the  contents  of  the  vessel  are  perfectly 
mixed  at  all  times. 
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Let  Rt  denote  the  event  “the  particle  has  a  residence  time  longer  than  t 
seconds.”  In  Section  5.3  we  will  see  how  continuous  stirring  determines  the 
probabilities;  here  we  just  use  that  in  a  particular  continuously  stirred  tank, 
Rt  has  probability  e“‘.  So: 

P(i?3)  =  e-3  =  0.04978  . . . 

P(i?4)  =e-^  =  0.01831...  . 


We  can  use  the  definition  of  conditional  probability  to  find  the  probability 
that  a  particle  that  has  stayed  more  than  3  seconds  will  stay  more  than  4: 


P(i?4|i?3) 


P(i?4ni?3) 

P(^^3) 


P(-R4) 

P(^^3) 


e-1  =  0.36787...  . 


Quick  exercise  3.4  Calculate  P(i?3  |  R%). 


For  more  details  on  the  subject  of  residence  time  distributions  see,  for  example, 
the  book  on  reaction  engineering  by  Fogler  ([11]). 


3.2  The  multiplication  rule 

From  the  definition  of  conditional  probability  we  derive  a  useful  rule  by  mul¬ 
tiplying  left  and  right  by  P(C). 


The  multiplication  rule.  For  any  events  A  and  C: 

P(AnC)  =P(AIC)  -PiC) . 


Computing  the  probability  oi  An  C  can  hence  be  decomposed  into  two  parts, 
computing  P(C')  and  P(A  |  C)  separately,  which  is  often  easier  than  computing 
P(A  n  C)  directly. 


The  probability  of  no  coincident  birthdays 


Suppose  you  meet  two  arbitrarily  chosen  people.  What  is  the  probability  their 
birthdays  are  different?  Let  i?2  denote  the  event  that  this  happens.  Whatever 
the  birthday  of  the  first  person  is,  there  is  only  one  day  the  second  person 
cannot  “pick”  as  birthday,  so: 


P{B2)  =  1- 


365' 


When  the  same  question  is  asked  with  three  people,  conditional  probabilities 
become  helpful.  The  event  can  be  seen  as  the  intersection  of  the  event  B2, 


28 


3  Conditional  probability  and  independence 


“the  first  two  have  different  birthdays,”  with  event  A3  “the  third  person  has 
a  birthday  that  does  not  coincide  with  that  of  one  of  the  first  two  persons.” 
Using  the  multiplication  rule: 


PiBs)  =  P{A3  n  B2)  =  P{A3  I  B2)P{B2)  . 


The  conditional  probability  P(A3  |  ^2)  is  the  probability  that,  when  two  days 
are  already  marked  on  the  calendar,  a  day  picked  at  random  is  not  marked, 
or 


P(A3|fi.)  =  1-355. 


and  so 


P{B3)=P{A3\B2)P{B2) 


('  -  3I)  (1  -  3k) = “ 


We  are  already  halfway  to  solving  the  general  question:  in  a  group  of  n  arbi¬ 
trarily  chosen  people,  what  is  the  probability  there  are  no  coincident  birth¬ 
days?  The  event  Bn  of  no  coincident  birthdays  among  the  n  persons  is  the 
same  as:  “the  birthdays  of  the  first  n  —  1  persons  are  different”  (the  event 
Bn-i)  and  “the  birthday  of  the  nth  person  does  not  coincide  with  a  birthday 
of  any  of  the  first  n  —  1  persons”  (the  event  An),  that  is. 

Bn  —  An  n  Bn-l- 

Applying  the  multiplication  rule  yields: 

P{Bn)  =  P{An  I  Bn-l)  '  P(S„_i)  =  (^1  -  •  P{Bn-l) 


as  person  n  should  avoid  n  —  1  days.  Applying  the  same  step  to  P(i?„_i), 
P{Bn-2),  etc.,  we  find: 


P{Bn)  =(1- 

=  11- 


n  —  1 
365 
n  —  1 
365 


P(A„_i  I  Bn-2)  ■  P{Bn-2) 

n  —  2' 


1  - 


365 


•  P{Bn-2) 


=  1- 


=  1- 


n  —  1 
365 
n  —  1 
365 


1  - 


365 


1  - 


365 


This  can  be  used  to  compute  the  probability  for  arbitrary  n.  For  example, 
we  find:  P(i?22)  =  0.52  43  and  P(i?23)  =  0.49  27.  In  Figure  3.1  the  probability 
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1.0  -I 
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Fig.  3.1.  The  probability  P(-Brt)  of  no  coincident  birthdays  for  n  =  1, . . . ,  100. 

P(i?„)  is  plotted  for  n  =  1, . . . ,  100,  with  dotted  lines  drawn  at  n  =  23  and 
at  probability  0.5.  It  may  be  hard  to  believe,  but  with  just  23  people  the 
probability  of  all  birthdays  being  different  is  less  than  50% ! 

Quick  exercise  3.5  Compute  the  probability  that  three  arbitrary  people  are 
born  in  different  months.  Can  you  give  the  formula  for  n  people? 

It  matters  how  one  conditions 

Conditioning  can  help  to  make  computations  easier,  but  it  matters  how  it  is 
applied.  To  compute  P(^  C  C)  we  may  condition  on  C  to  get 

P(AnC')  =P(A|C')  •P(C'); 

or  we  may  condition  on  A  and  get 

p(AnC')  =  P(C'|A)  •p(^). 

Both  ways  are  valid,  but  often  one  of  P(A  |  C)  and  P(C  |  A)  is  easy  and  the 
other  is  not.  For  example,  in  the  birthday  example  one  could  have  tried: 

P(B3)  =  P(A3  n  B2)  =  p(B2  I  A3)P(A3)  , 

but  just  trying  to  understand  the  conditional  probability  P(i?2  |  ^3)  already 
is  confusing: 

The  probability  that  the  first  two  persons’  birthdays  differ  given  that 
the  third  person’s  birthday  does  not  coincide  with  the  birthday  of  one 
of  the  first  two  . . .  ? 

Conditioning  should  lead  to  easier  probabilities;  if  not,  it  is  probably  the 
wrong  approach. 
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3.3  The  law  of  total  probability  and  Bayes’  rule 

We  will  now  discuss  two  important  rules  that  help  probability  computations 
by  means  of  conditional  probabilities.  We  introduce  both  of  them  in  the  next 
example. 

Testing  for  mad  cow  disease 

In  early  2001  the  European  Commission  introduced  massive  testing  of  cattle 
to  determine  infection  with  the  transmissible  form  of  Bovine  Spongiform  En¬ 
cephalopathy  (BSE)  or  “mad  cow  disease.”  As  no  test  is  100%  accurate,  most 
tests  have  the  problem  of  false  positives  and  false  negatives.  A  false  positive 
means  that  according  to  the  test  the  cow  is  infected,  but  in  actuality  it  is  not. 
A  false  negative  means  an  infected  cow  is  not  detected  by  the  test. 

Imagine  we  test  a  cow.  Let  B  denote  the  event  “the  cow  has  BSE”  and  T 
the  event  “the  test  comes  up  positive”  (this  is  test  jargon  for:  according  to 
the  test  we  should  believe  the  cow  is  infected  with  BSE).  One  can  “test  the 
test”  by  analyzing  samples  from  cows  that  are  known  to  be  infected  or  known 
to  be  healthy  and  so  determine  the  effectiveness  of  the  test.  The  European 
Commission  had  this  done  for  four  tests  in  1999  (see  [19])  and  for  several 
more  later.  The  results  for  what  the  report  calls  Test  A  may  be  summarized 
as  follows:  an  infected  cow  has  a  70%  chance  of  testing  positive,  and  a  healthy 
cow  just  10%;  in  formulas: 


V{T\B)  =  0.70, 

V{T\B^)  =  0.10. 

Suppose  we  want  to  determine  the  probability  P(T)  that  an  arbitrary  cow 
tests  positive.  The  tested  cow  is  either  infected  or  it  is  not:  event  T  occurs  in 
combination  with  B  or  with  B^  (there  are  no  other  possibilities) .  In  terms  of 
events 

T=  (TnB)u(rnB^), 

so  that 

P(T)  =  P(T  n  B)  +  P(T  n  B^) , 

because  TDB  and  TnB‘^  are  disjoint.  Next,  apply  the  multiplication  rule  (in 
such  a  way  that  the  known  conditional  probabilities  appear!): 

P{TnB)  =  P{T\B)-P{B) 
p(rnB°)  =  p(T|B^)  •p(B'=)  ^ 

so  that 

P(T)  =  P(r  I  B)  ■  P{B)  +  P(T  I  B^^)  •  P(B=) .  (3.2) 

This  is  an  application  of  the  law  of  total  probability:  computing  a  probability 
through  conditioning  on  several  disjoint  events  that  make  up  the  whole  sample 
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space  (in  this  case  two).  Suppose^  P('B)  =  0.02;  then  from  the  last  equation 
we  conclude:  P(r)  =  0.02  •  0.70  +  (1  -  0.02)  •  0.10  =  0.112. 

Quick  exercise  3.6  Calculate  P(r)  when  P(T  |  B)  =  0.99  and  P(T  |  B’^)  = 
0.05. 

Following  is  a  general  statement  of  the  law. 


The  law  of  total  probability.  Suppose  Ci,  C2,  ...,  Cm  are 
disjoint  events  such  that  Ci  U  C2  U  •  •  •  U  Cm  =  The  probability  of 
an  arbitrary  event  A  can  be  expressed  as: 

P(7l)  =  P(7l  I  Ci)P(C'i)  +  P(7l  I  C’2)P(C2)  +  •  •  •  +  P(7l  I  Cm)^{Cm)  ■ 


Figure  3.2  illustrates  the  law  for  m  =  5.  The  event  A  is  the  disjoint  union  of 
AnCi,  for  i  =  1, . . . ,  5,  so  P(7l)  =  P(A  n  Ci)  +  ■  ■  •  +  P(A  n  C5),  and  for  each  i 
the  multiplication  rule  states  P(A  n  Ci)  =  P(A  |  Ci)  ■  P(C'i). 


Fig.  3.2.  The  law  of  total  probability  (illustration  for  m  =  5). 


In  the  BSE  example,  we  have  just  two  mutually  exclusive  events:  substitute 
m  =  2,  Cl  =  B,  C2  =  B‘^,  and  ^  =  T  to  obtain  (3.2). 

Another,  perhaps  more  pertinent,  question  about  the  BSE  test  is  the  following: 
suppose  my  cow  tests  positive;  what  is  the  probability  it  really  has  BSE? 
Translated,  this  asks  for  the  value  of  P(B  |  T).  The  information  we  were  given 
is  P(T|  B),  a  conditional  probability,  but  the  wrong  one.  We  would  like  to 
switch  T  and  B. 

Start  with  the  definition  of  conditional  probability  and  then  use  equations 
(3.1)  and  (3.2): 

^  We  choose  this  probability  for  the  sake  of  the  calculations  that  follow.  The  true 
value  is  unknown  and  varies  from  country  to  country.  The  BSE  risk  for  the  Nether¬ 
lands  for  2003  was  estimated  to  be  P(i3)  ~  0.000013. 
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P{B\T) 


p(Tns) 

P(r) 


P{T\B)  •P(B) 

P(T  I  B)  ■  P{B)  +  P{T  I  •  P{B^)  ■ 


So  with  P(i?)  =  0.02  we  find 

P{B  I  T)  =  - - -  =  0.125, 

^  ’  0.70 -0.02 +  0.10- (1  -  0.02) 

and  by  a  similar  calculation:  P(i3  |  T'^)  =  0.0068.  These  probabilities  reflect 
that  this  Test  A  is  not  a  very  good  test;  a  perfect  test  would  result  in 
P(i?|r)  =  1  and  P{B\T'^)  =  0.  In  Exercise  3.4  we  redo  this  calculation, 
replacing  P{B)  =  0.02  with  a  more  realistic  number. 

What  we  have  just  seen  is  known  as  Bayes’  rule,  after  the  English  clergyman 
Thomas  Bayes  who  derived  this  in  the  18th  century.  The  general  statement 
follows. 


Bayes’  rule.  Suppose  the  events  Ci,  C2,  ■  ■■ ,  Cm  are  disjoint  and 
Cl  U  (72  U  •  •  •  U  Cm  =  The  conditional  probability  of  Ci,  given  an 
arbitrary  event  A,  can  be  expressed  as: 


P(C  I  A) 


P(A\Ci)-P{Ci) 

P{A  I  Cl)P((7l)  +  P(A  I  C2)P{C2)  +  •  •  •  +  P(A  I  Cm)P{Cm)  ■ 


This  is  the  traditional  form  of  Bayes’  formula.  It  follows  from 


P(C,|A) 


P(A|C0-P(C,) 

P(A) 


(3.3) 


in  combination  with  the  law  of  total  probability  applied  to  P(A)  in  the  de¬ 
nominator.  Purists  would  refer  to  (3.3)  as  Bayes’  rule,  and  perhaps  they  are 
right. 


Quick  exercise  3.7  Calculate  P{B  \  T)  and  P{B  \  T^)  if  P(T  |  B)  =  0.99  and 
P(T|SQ  =  0.05. 


3.4  Independence 

Consider  three  probabilities  from  the  previous  section: 

P{B)  =  0.02, 

P{B\T)  =  0.125, 

P(B|rQ  =  0.0068. 

If  we  know  nothing  about  a  cow,  we  would  say  that  there  is  a  2%  chance  it  is 
infected.  However,  if  we  know  it  tested  positive,  we  can  say  there  is  a  12.5% 
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chance  the  cow  is  infected.  On  the  other  hand,  if  it  tested  negative,  there  is 
only  a  0.68%  chance.  We  see  that  the  two  events  are  related  in  some  way:  the 
probability  of  B  depends  on  whether  T  occurs. 

Imagine  the  opposite:  the  test  is  useless.  Whether  the  cow  is  infected  is  unre¬ 
lated  to  the  outcome  of  the  test,  and  knowing  the  outcome  of  the  test  does  not 
change  our  probability  of  B-.  V{B  \  T)  =  P(i3).  In  this  case  we  would  call  B 
independent  of  T. 


Definition.  An  event  A  is  called  independent  of  B  if 

P(A|B)  =P(A). 


From  this  simple  definition  many  statements  can  be  derived.  For  example, 
because  P(A'^  \  B)  =  1  —  P(A  |  B)  and  1  —  P(A)  =  P(A'^),  we  conclude: 

A  independent  oi  B  A^^  independent  of  B.  (3.4) 

By  application  of  the  multiplication  rule,  if  A  is  independent  of  S,  then 
P(A  C^B)  =  P(A  I  B)V{B)  =  P(A)  P(B).  On  the  other  hand,  if  P(A  n  B)  = 
P(A)  P(i3),  then  V{A\B)  =  P(A)  follows  from  the  definition  of  independence. 
This  shows: 


A  independent  of  i?  P(A  B)  =  P(A)  P(i?) . 

Finally,  by  definition  of  conditional  probability,  if  A  is  independent  of  B,  then 


B{B\A)  = 


P(AnB)  P(A)-P(B) 


=  nB), 


P(A)  P(A) 

that  is,  B  is  independent  of  A.  This  works  in  reverse,  too,  so  we  have: 

A  independent  of  i?  B  independent  of  A.  (3.5) 


This  statement  says  that  in  fact,  independence  is  a  mutual  property.  Therefore, 
the  expressions  “A  is  independent  of  and  “A  and  B  are  independent”  are 
used  interchangeably.  From  the  three  <t4-statements  it  follows  that  there  are 
in  fact  12  ways  to  show  that  A  and  B  are  independent;  and  if  they  are,  there 
are  12  ways  to  use  that. 


Independence.  To  show  that  A  and  B  are  independent  it  suffices 
to  prove  just  one  of  the  following: 

P(A|i3)=P(A), 

P(i?|A)=P(i?), 

P(AnS)  =  P(A)P(S), 

where  A  may  be  replaced  by  A°  and  B  replaced  by  B^^,  or  both.  If 
one  of  these  statements  holds,  all  of  them  are  true.  If  two  events  are 
not  independent,  they  are  called  dependent. 
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Recall  the  birthday  events  L  “born  in  a  long  month”  and  R  “born  in  a  month 
with  the  letter  r.”  Let  H  be  the  event  “born  in  the  first  half  of  the  year,” 
so  P(i?)  =  1/2.  Also,  P(iL  I  R)  =  1/2.  So  H  and  R  are  independent,  and  we 
conclude,  for  example,  P{R'^  \  =  P(i?'^)  =  1  —  8/12  =  1/3. 

We  know  that  P(L  n  H)  =  1/4  and  P(L)  =  7/12.  Checking  1/2  x  7/12  yl  1/4, 
you  conclude  that  L  and  H  are  dependent. 

Quick  exercise  3.8  Derive  the  statement  “i?“  is  independent  of  H"’  from 
“77  is  independent  of  7?”  using  rules  (3.4)  and  (3.5). 

Since  the  words  dependence  and  independence  have  several  meanings,  one 
sometimes  uses  the  terms  stochastic  or  statistical  dependence  and  indepen¬ 
dence  to  avoid  ambiguity. 

Remark  3.1  (Physical  and  stochastic  independence).  Stochastic 
dependence  or  independence  can  sometimes  be  established  by  inspecting 
whether  there  is  any  physical  dependence  present.  The  following  statements 
may  be  made. 

If  events  have  to  do  with  processes  or  experiments  that  have  no  physical  con¬ 
nection,  they  are  always  stochastically  independent.  If  they  are  connected 
to  the  same  physical  process,  then,  as  a  rule,  they  are  stochastically  de¬ 
pendent,  but  stochastic  independence  is  possible  in  exceptional  cases.  The 
events  H  and  R  are  an  example. 

Independence  of  two  or  more  events 

When  more  than  two  events  are  involved  we  need  a  more  elaborate  definition 
of  independence.  The  reason  behind  this  is  explained  by  an  example  following 
the  definition. 


Independence  of  two  or  more  events.  Events  Ai,  A2,  ..., 
Am  are  called  independent  if 

p(Ai  n  A2  n  •  •  •  n  A™)  =  P(Ai)  p(A2)  •  •  •  p(A^) 

and  this  statement  also  holds  when  any  number  of  the  events  Ai, 
. . . ,  Am  are  replaced  by  their  complements  throughout  the  formula. 


You  see  that  we  need  to  check  2™  equations  to  establish  the  independence  of 
m  events.  In  fact,  m  -I-  I  of  those  equations  are  redundant,  but  we  chose  this 
version  of  the  definition  because  it  is  easier. 

The  reason  we  need  to  do  so  much  more  checking  to  establish  independence 
for  multiple  events  is  that  there  are  subtle  ways  in  which  events  may  depend 
on  each  other.  Consider  the  question: 

Is  independence  for  three  events  A,  77,  and  C  the  same  as:  A  and  B  are 
independent;  B  and  C  are  independent;  and  A  and  C  are  independent? 
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The  answer  is  “No,”  as  the  following  example  shows.  Perform  two  independent 
tosses  of  a  coin.  Let  A  be  the  event  “heads  on  toss  1,”  B  the  event  “heads  on 
toss  2,”  and  C  “the  two  tosses  are  equal.” 

First,  get  the  probabilities.  Of  course,  P(vl)  =  P(B)  =  1/2,  but  also 

PIC)  =  PM  nB)  +  PM"  nB^^)  =  -  +  -  = 

4  4  2 

What  about  independence?  Events  A  and  B  are  independent  by  assumption, 
so  check  the  independence  of  A  and  C.  Given  that  the  first  toss  is  heads  {A 
occurs),  C  occurs  if  and  only  if  the  second  toss  is  heads  as  well  {B  occurs),  so 

P{C\A)=P{B\A)=P{B)  =  ^  =  P{C). 

By  symmetry,  also  P{C\B)  =  P(C'),  so  all  pairs  taken  from  A,  B,  C  are 
independent:  the  three  are  called  pairwise  independent.  Checking  the  full  con¬ 
ditions  for  independence,  we  find,  for  example: 

P{Ar\Br\C)  =P{AnB)  =  l,  whereas  P(A)  P(B)  P(C')  =  i, 

4  8 

and 

P(AnBnC'")  =P(0)  =  0,  whereas  P{A)P{B)P{C^)  = 

8 

The  reason  for  this  is  clear:  whether  C  occurs  follows  deterministically  from 
the  outcomes  of  tosses  1  and  2. 


3.5  Solutions  to  the  quick  exercises 


3.1  N  =  {May,  Jun,  Jul,  Aug},  L  =  (Jan,  Mar,  May,  Jul,  Aug,  Oct,  Dec}, 
and  N  L  =  {May,  Jul,  Aug}.  Three  out  of  seven  outcomes  of  L  belong  to 
N  as  well,  so  P{N  \  L)  =  3/7. 

3.2  The  event  A  is  contained  in  C.  So  when  A  occurs,  C  also  occurs;  therefore 
P(C'|  A)  =  1. 

Since  C"  =  {123, 321}  and  AU  B  =  {123, 321,  312, 213},  one  can  see  that  two 
of  the  four  outcomes  of  A  U  i?  belong  to  C"  as  well,  so  P(C'"  |AUi3)  =  1/2. 


3.3  Using  the  definition  we  find: 

P(A|C)-kP(A"|C') 


p(Anc)  p(A"nc) 
P(C')  P(C') 


because  C  can  be  split  into  disjoint  parts  A  n  C  and  A"  n  C  and  therefore 


P(A  0(7)-^  P(A"  0(7)=  P(C') . 
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3.4  This  asks  for  the  probability  that  the  particle  stays  more  than  3  seconds, 
given  that  it  does  not  stay  longer  than  4  seconds,  so  4  or  less.  From  the 
definition: 


PiRslRl) 


RjRs  n  Rj) 

mt) 


The  event  R3  n  Rl  describes:  longer  than  3  but  not  longer  than  4  seconds. 
Furthermore,  R3  is  the  disjoint  union  of  the  events  i?3ni?4  and  i?3ni?4  =  R4, 
so  P(i?3  n  Rl)  =  P(i?3)  —  PiRi)  =  e~^  —  e“^.  Using  the  complement  rule: 
P(i?2)  =  1  -  P(i?4)  =  1  _  e-4.  Together: 


P{R3  I  i?4) 


e  ^  —  e  ^ 
1  -e-4 


0.0315 

0.9817 


0.0321. 


3.5  Instead  of  a  calendar  of  365  days,  we  have  one  with  just  12  months.  Let 
Cn  be  the  event  n  arbitrary  persons  have  different  months  of  birth.  Then 


^  =  0.7639 
72 


and  it  is  no  surprise  that  this  is  much  smaller  than  P(i?3).  The  general  formula 
is 


Note  that  it  is  correct  even  if  n  is  13  or  more,  in  which  case  P(C'„)  =  0. 


3.6  Repeating  the  calculation  we  find: 

P(T  n  R)  =  0.99  •  0.02  =  0.0198 
P(T  n  =  0.05  •  0.98  =  0.0490 

so  p(r)  =  p(r  n  R)  +  p(r  n  b^)  =  o.oi98  +  0.0490  =  0.0688. 


3.7  In  the  solution  to  Quick  exercise  3.5  we  already  found  P(T  n  i?)  =  0.0198 
and  P(r)  =  0.0688,  so 


P{B\T) 


P(TnR) 

pm 


0.0198 

0.0688 


0.2878. 


Further,  P(rQ  =  1  -  0.0688  =  0.9312  and  P(r“  \B)  =  1-  P(T  |  B)  =  0.01. 
So,  P(R  n  TQ  =  0.01  •  0.02  =  0.0002  and 


P(R|TQ 


0.0002 

0.9312 


0.00021. 


3.8  It  takes  three  steps  of  applying  (3.4)  and  (3.5): 


H  independent  of  i? 
77'^  independent  of  i? 
R  independent  of 


H'^  independent  of  R  by  (3.4) 
R  independent  of  77^  by  (3.5) 
R^  independent  of  77'^  by  (3.4) 


3.6  Exercises 


37 


3.6  Exercises 

3.1  ffl  Your  lecturer  wants  to  walk  from  A  to  B  (see  the  map).  To  do  so,  he 
first  randomly  selects  one  of  the  paths  to  C,  D,  or  E.  Next  he  selects  randomly 
one  of  the  possible  paths  at  that  moment  (so  if  he  first  selected  the  path  to 
E,  he  can  either  select  the  path  to  A  or  the  path  to  F),  etc.  What  is  the 
probability  that  he  will  reach  B  after  two  selections? 


3.2  ffl  A  fair  die  is  thrown  twice.  A  is  the  event  “sum  of  the  throws  equals  4,” 
B  is  “at  least  one  of  the  throws  is  a  3.” 

a.  Calculate  P(A  |  B). 

b.  Are  A  and  B  independent  events? 

3.3  ffl  We  draw  two  cards  from  a  regular  deck  of  52.  Let  be  the  event  “the 
first  one  is  a  spade,”  and  S2  “the  second  one  is  a  spade.” 

a.  Compute  P(S'i),  P(5'2  |  5i),  and  P(S'2  |  5?). 

b.  Compute  P(S'2)  by  conditioning  on  whether  the  first  card  is  a  spade. 

3.4  □  A  Dutch  cow  is  tested  for  BSE,  using  Test  A  as  described  in  Section  3.3, 

with  P(r  I  B)  =  0.70  and  P(T  |  =  0.10.  Assume  that  the  BSE  risk  for  the 

Netherlands  is  the  same  as  in  2003,  when  it  was  estimated  to  be  P(i?)  = 
1.3  •  10"^  Compute  F{B\T)  and  P(B  |  T^). 

3.5  A  ball  is  drawn  at  random  from  an  urn  containing  one  red  and  one  white 
ball.  If  the  white  ball  is  drawn,  it  is  put  back  into  the  urn.  If  the  red  ball 
is  drawn,  it  is  returned  to  the  urn  together  with  two  more  red  balls.  Then  a 
second  draw  is  made.  What  is  the  probability  a  red  ball  was  drawn  on  both 
the  first  and  the  second  draws? 

3.6  We  choose  a  month  of  the  year,  in  such  a  manner  that  each  month  has 
the  same  probability.  Find  out  whether  the  following  events  are  independent: 

a.  the  events  “outcome  is  an  even  numbered  month”  (i.e.,  February,  April, 
June,  etc.)  and  “outcome  is  in  the  first  half  of  the  year.” 

b.  the  events  “outcome  is  an  even  numbered  month”  (i.e.,  February,  April, 
June,  etc.)  and  “outcome  is  a  summer  month”  (i.e.,  June,  July,  August). 
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3  Conditional  probability  and  independence 


3.7  ffl  Calculate 

a.  P(yl  U  B)  if  it  is  given  that  P(A)  =  1/3  and  V{B  \  =  1/4. 

b.  P{B)  if  it  is  given  that  P(^  U  B)  =  2/3  and  P{A‘^  \  B‘^)  =  1/2. 

3.8  ffl  Spaceman  Spiff’s  spacecraft  has  a  warning  light  that  is  supposed  to 
switch  on  when  the  freem  blasters  are  overheated.  Let  W  be  the  event  “the 
warning  light  is  switched  on”  and  F  “the  freem  blasters  are  overheated.” 
Suppose  the  probability  of  freem  blaster  overheating  P{F)  is  0.1,  that  the 
light  is  switched  on  when  they  actually  are  overheated  is  0.99,  and  that  there 
is  a  2%  chance  that  it  comes  on  when  nothing  is  wrong:  P{W  \  F“)  =  0.02. 

a.  Determine  the  probability  that  the  warning  light  is  switched  on. 

b.  Determine  the  conditional  probability  that  the  freem  blasters  are  over¬ 
heated,  given  that  the  warning  light  is  on. 

3.9  □  A  certain  grapefruit  variety  is  grown  in  two  regions  in  southern  Spain. 
Both  areas  get  infested  from  time  to  time  with  parasites  that  damage  the 
crop.  Let  A  be  the  event  that  region  Ri  is  infested  with  parasites  and  B  that 
region  i?2  is  infested.  Suppose  P(A)  =  3/4,  P(B)  =  2/5  and  P(AU  B)  =  4/5. 
If  the  food  inspection  detects  the  parasite  in  a  ship  carrying  grapefruits  from 
i?i,  what  is  the  probability  region  i?2  is  infested  as  well? 

3.10  A  student  takes  a  multiple-choice  exam.  Suppose  for  each  question  he 
either  knows  the  answer  or  gambles  and  chooses  an  option  at  random.  Further 
suppose  that  if  he  knows  the  answer,  the  probability  of  a  correct  answer  is  1, 
and  if  he  gambles  this  probability  is  1/4.  To  pass,  students  need  to  answer  at 
least  60%  of  the  questions  correctly.  The  student  has  “studied  for  a  minimal 
pass,”  i.e.,  with  probability  0.6  he  knows  the  answer  to  a  question.  Given  that 
he  answers  a  question  correctly,  what  is  the  probability  that  he  actually  knows 
the  answer? 

3.11  A  breath  analyzer,  used  by  the  police  to  test  whether  drivers  exceed 
the  legal  limit  set  for  the  blood  alcohol  percentage  while  driving,  is  known  to 
satisfy 

P{A\B)  =  P{A^\B-)=p, 

where  A  is  the  event  “breath  analyzer  indicates  that  legal  limit  is  exceeded” 
and  B  “driver’s  blood  alcohol  percentage  exceeds  legal  limit.”  On  Saturday 
night  about  5%  of  the  drivers  are  known  to  exceed  the  limit. 

a.  Describe  in  words  the  meaning  of  P(i?“  |  A). 

b.  Determine  P(i3“  |  A)  if  p  =  0.95. 

c.  How  big  should  p  be  so  that  P{B  \  A)  =  0.9? 


3.12  The  events  A,  B,  and  C  satisfy:  P(A  \  B  {PC)  =  1/4,  P{B  \  C)  =  1/3, 
and  P(C')  =  1/2.  Calculate  P(A“  PB  PC). 


3.6  Exercises 
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3.13  In  Exercise  2.12  we  computed  the  probability  of  a  “dream  draw”  in  the 
UEFA  playoffs  lottery  by  counting  outcomes.  Recall  that  there  were  ten  teams 
in  the  lottery,  five  considered  “strong”  and  five  considered  “weak.”  Introduce 
events  Hi,  “the  ith  pair  drawn  is  a  dream  combination,”  where  a  “dream 
combination”  is  a  pair  of  a  strong  team  with  a  weak  team,  and  z  =  1, . . . ,  5. 

a.  Compute  P(Hi). 

b.  Compute  P(H2  |  Hi)  and  P(Hi  n  H2). 

c.  Compute  P(H3  |  Hi  n  H2)  and  P(Hi  n  H2  C  H3). 

d.  Continue  the  procedure  to  obtain  the  probability  of  a  “dream  draw”: 
P(Hin---nH5). 

3.14  Recall  the  Monty  Hall  problem  from  Section  1.3.  Let  R  be  the  event 
“the  prize  is  behind  the  door  you  chose  initially,”  and  W  the  event  “you  win 
the  prize  by  switching  doors.” 

a.  Compute  V{W  \  R)  and  V{W  \ 

b.  Compute  P(IU)  using  the  law  of  total  probability. 

3.15  Two  independent  events  A  and  B  are  given,  and  P(H  |  A  U  H)  =  2/3, 
P(A  I  B)  =  1/2.  What  is  P(H)? 

3.16  You  are  diagnosed  with  an  uncommon  disease.  You  know  that  there 
only  is  a  1%  chance  of  getting  it.  Use  the  letter  H  for  the  event  “you  have  the 
disease”  and  T  for  “the  test  says  so.”  It  is  known  that  the  test  is  imperfect: 
P(T  I  H)  =  0.98  and  P(T=  |  H“)  =  0.95. 

a.  Given  that  you  test  positive,  what  is  the  probability  that  you  really  have 
the  disease? 

b.  You  obtain  a  second  opinion:  an  independent  repetition  of  the  test.  You 
test  positive  again.  Given  this,  what  is  the  probability  that  you  really  have 
the  disease? 

3.17  You  and  I  play  a  tennis  match.  It  is  deuce,  which  means  if  you  win  the 
next  two  rallies,  you  win  the  game;  if  I  win  both  rallies,  I  win  the  game;  if 
we  each  win  one  rally,  it  is  deuce  again.  Suppose  the  outcome  of  a  rally  is 
independent  of  other  rallies,  and  you  win  a  rally  with  probability  p.  Let  W  be 
the  event  “you  win  the  game,”  G  “the  game  ends  after  the  next  two  rallies,” 
and  H  “it  becomes  deuce  again.” 

a.  Determine  Y’{W  \G). 

b.  Show  that  P(IU)  =  +  2p(l  —  p)V{W  \  D)  and  use  P(IU)  =  P(IU  |  H) 

(why  is  this  so?)  to  determine  P(IU). 

c.  Explain  why  the  answers  are  the  same. 
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3  Conditional  probability  and  independence 


3.18  Suppose  A  and  B  are  events  with  0  <  P(A)  <  1  and  0  <  P(B)  <  1. 

a.  If  A  and  B  are  disjoint,  can  they  be  independent? 

b.  If  A  and  B  are  independent,  can  they  be  disjoint? 

c.  If  A  C  B,  can  A  and  B  be  independent? 

d.  If  A  and  B  are  independent,  can  A  and  AVJ  B  he  independent? 


4 


Discrete  random  variables 


The  sample  space  associated  with  an  experiment,  together  with  a  probability 
function  defined  on  all  its  events,  is  a  complete  probabilistic  description  of 
that  experiment.  Often  we  are  interested  only  in  certain  features  of  this  de¬ 
scription.  We  focus  on  these  features  using  random  variables.  In  this  chapter 
we  discuss  discrete  random  variables,  and  in  the  next  we  will  consider  contin¬ 
uous  random  variables.  We  introduce  the  Bernoulli,  binomial,  and  geometric 
random  variables. 


4.1  Random  variables 

Suppose  we  are  playing  the  board  game  “Snakes  and  Ladders,”  where  the 
moves  are  determined  by  the  sum  of  two  independent  throws  with  a  die.  An 
obvious  choice  of  the  sample  space  is 


ft  —  {(wi,  tt’2)  :  0^1,  0^2  G  {1,  2, . . . ,  6}  } 

=  {(1,1),(1,2),...,(1,6),(2,1),...,(6,5),(6,6)}. 

However,  as  players  of  the  game,  we  are  only  interested  in  the  sum  of  the 
outcomes  of  the  two  throws,  i.e.,  in  the  value  of  the  function  S' :  ^  K,  given 

by 

S{UJI,UJ2)  =  +  ^^2  for  (a;i,tt’2)  G  n. 

In  Table  4.1  the  possible  results  of  the  first  throw  (top  margin),  those  of  the 
second  throw  (left  margin),  and  the  corresponding  values  of  S  (body)  are 
given.  Note  that  the  values  of  S  are  constant  on  lines  perpendicular  to  the 
diagonal.  We  denote  the  event  that  the  function  S  attains  the  value  k  by 
{S  =  k},  which  is  an  abbreviation  of  “the  subset  of  those  lo  =  (cji,tt’2)  G  H 
for  which  S(  wi,  a;2  )  =  uji  uj2  =  k,”  i.e.. 


{S  =  k}  =  {(a;i,a;2)  G  fl  :  S(wi,tt’2)  =  k}. 
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4  Discrete  random  variables 


Table  4.1.  Two  throws  with  a  die  and  the  corresponding  sum. 


UJ2 

or 

1 

2 

3 

4 

5 

6 

1 

2 

3 

4 

5 

6 

7 

2 

3 

4 

5 

6 

7 

8 

3 

4 

5 

6 

7 

8 

9 

4 

5 

6 

7 

8 

9 

10 

5 

6 

7 

8 

9 

10 

11 

6 

7 

8 

9 

10 

11 

12 

Quick  exercise  4.1  List  the  outcomes  in  the  event  {S'  =  8}. 

We  denote  the  probability  of  the  event  jS  =  fc}  by 

P(S  =  fc), 

although  formally  we  should  write  P({S  =  fc})  instead  of  P(S  =  fc).  In  our 
example,  S  attains  only  the  values  k  =  2, 3, . . . ,  12  with  positive  probability. 
For  example, 

P(S  =  2)=P((1,1))  =  1, 

P(S  =  3)  =  P({(1,2),(2,1)})=  A, 

while 

P(S=13)  =  P(0)  =  O, 
because  13  is  an  “impossible  outcome.” 

Quick  exercise  4.2  Use  Table  4.1  to  determine  P(S  =  fc)  for  fc  =  4, 5, . . . ,  12. 

Now  suppose  that  for  some  other  game  the  moves  are  given  by  the  maximum 
of  two  independent  throws.  In  this  case  we  are  interested  in  the  value  of  the 
function  M  :  U  ^  R,  given  by 


M{u!i,Ll!2  )  =  max{a;i,a;2}  for  (a;i,a;2)  G 

In  Table  4.2  the  possible  results  of  the  first  throw  (top  margin),  those  of  the 
second  throw  (left  margin),  and  the  corresponding  values  of  M  (body)  are 
given.  The  functions  S  and  M  are  examples  of  what  we  call  discrete  random 
variables. 


Definition.  Let  U  be  a  sample  space.  A  discrete  random  variable 
is  a  function  A  :  U  — >  R  that  takes  on  a  finite  number  of  values 
tti,  02, . . . ,  a„  or  an  infinite  number  of  values  oi,  02, . . . . 


4.2  The  probability  distribution  of  a  discrete  random  variable 


43 


Table  4.2.  Two  throws  with  a  die  and  the  corresponding  maximum. 


U)2 

OJl 

1 

2 

3 

4 

5 

6 

1 

1 

2 

3 

4 

5 

6 

2 

2 

2 

3 

4 

5 

6 

3 

3 

3 

3 

4 

5 

6 

4 

4 

4 

4 

4 

5 

6 

5 

5 

5 

5 

5 

5 

6 

6 

6 

6 

6 

6 

6 

6 

In  a  way,  a  discrete  random  variable  X  “transforms”  a  sample  space  to  a 
more  “tangible”  sample  space  whose  events  are  more  directly  related  to 
what  you  are  interested  in.  For  instance,  S  transforms  =  {(1, 1),  (1, 2), ... , 
(1, 6),  (2, 1), . . . ,  (6,  5),  (6,  6)}  to  ii  =  {2,...,  12},  and  M  transforms  Q.  to 
fi  =  {1, ...  ,6}.  Of  course,  there  is  a  price  to  pay:  one  has  to  calculate  the 
probabilities  of  X.  Or,  to  say  things  more  formally,  one  has  to  determine 
the  probability  distribution  of  X,  i.e.,  to  describe  how  the  probability  mass  is 
distributed  over  possible  values  of  X. 


4.2  The  probability  distribution  of  a  discrete  random 
variable 

Once  a  discrete  random  variable  X  is  introduced,  the  sample  space  O  is  no 
longer  important.  It  suffices  to  list  the  possible  values  of  X  and  their  corre¬ 
sponding  probabilities.  This  information  is  contained  in  the  probability  mass 
function  of  X. 


Definition.  The  probability  mass  function  p  of  a  discrete  random 
variable  X  is  the  function  p  :  K  — >  [0, 1],  defined  by 

p(a)  =  P(X  =  a)  for  —  00  <  a  <  oo. 


If  X  is  a  discrete  random  variable  that  takes  on  the  values  ai,  02, . . .,  then 
p(ai)  >  0,  p(ai)  -|-p(a2)  -I-  •  •  •  =  1,  and  p(a)  =  0  for  all  other  a. 
As  an  example  we  give  the  probability  mass  function  p  of  M. 


a  1  2  3  4  5  6 

p(a)  1/36  3/36  5/36  7/36  9/36  11/36 


Of  course,  p(a)  =  0  for  all  other  a. 
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4  Discrete  random  variables 


The  distribution  function  of  a  random  variable 

As  we  will  see,  so-called  continuous  random  variables  cannot  be  specified 
by  giving  a  probability  mass  function.  However,  the  distribution  function  of 
a  random  variable  X  (also  known  as  the  cumulative  distribution  function) 
allows  us  to  treat  discrete  and  continuous  random  variables  in  the  same  way. 


Definition.  The  distribution  function  F  of  a  random  variable  X 
is  the  function  F  :  K  — >  [0, 1],  defined  by 

F{a)  =  P(A  <  a)  for  — oo  <  a  <  oo. 


Both  the  probability  mass  function  and  the  distribution  function  of  a  discrete 
random  variable  X  contain  all  the  probabilistic  information  of  X;  the  probabil¬ 
ity  distribution  of  X  is  determined  by  either  of  them.  In  fact,  the  distribution 
function  F  of  a  discrete  random  variable  X  can  be  expressed  in  terms  of  the 
probability  mass  function  p  of  X  and  vice  versa.  If  X  attains  values  oi,  02, . . ., 
such  that 

p{ai)  >  0,  p{ai) p{a2) -\ - =  1, 

then 

Fia)  =  p{ai). 

ai<a 

We  see  that,  for  a  discrete  random  variable  X,  the  distribution  function  F 
jumps  in  each  of  the  Oi,  and  is  constant  between  successive  a^.  The  height  of 
the  jump  at  Oi  is  p{ai);  in  this  way  p  can  be  retrieved  from  F.  For  example, 
see  Figure  4.1,  where  p  and  F  are  displayed  for  the  random  variable  M . 


1 


F(a) 


25/36  - 


11/36  - 
9/36  - 
7/36  - 
5/36  - 
3/36  - 
1/36  - 


p{a) 


16/36  - 

9/36  - 

4/36  - 
1/36  - 


1  2  3  4  5  6 


1  2  3  4  5  6 


a 


a 


Fig.  4.1.  Probability  mass  function  and  distribution  function  of  M. 


4.3  The  Bernoulli  and  binomial  distributions 
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We  end  this  section  with  three  properties  of  the  distribution  function  F  oi  & 
random  variable  X\ 

1.  For  a  <  b  one  has  that  F{a)  <  F{b).  This  property  is  an  immediate 
consequence  of  the  fact  that  a  <  b  implies  that  the  event  {X  <  a}  is 
contained  in  the  event  {X  <  b}. 

2.  Since  F{a)  is  a  probability,  the  value  of  the  distribution  function  is  always 
between  0  and  1.  Moreover, 

lim  F{a)  =  lim  P(X  <  a)  =  1 

a — »-+oo  a — ^+00 

lim  F{a)  =  lim  P(X  <  a)  =  0. 

a — >  —  oo  a — >■  — oo 

3.  F  is  right-continuous,  i.e.,  one  has 

limF(a  -be)  =  F{a). 

elO 

This  is  indicated  in  Figure  4.1  by  bullets.  Henceforth  we  will  omit  these 
bullets. 

Conversely,  any  function  F  satisfying  1,  2,  and  3  is  the  distribution  function 
of  some  random  variable  (see  Remarks  6.1  and  6.2). 

Quick  exercise  4.3  Let  X  be  a  discrete  random  variable,  and  let  a  be  such 
that  p{a)  >  0.  Show  that  F{a)  =  P(X  <  a)  +p{a). 

There  are  many  discrete  random  variables  that  arise  in  a  natural  way.  We 
introduce  three  of  them  in  the  next  two  sections. 


4.3  The  Bernoulli  and  binomial  distributions 

The  Bernoulli  distribution  is  used  to  model  an  experiment  with  only  two  pos¬ 
sible  outcomes,  often  referred  to  as  “success”  and  “failure”,  usually  encoded 
as  1  and  0. 


Definition.  A  discrete  random  variable  X  has  a  Bernoulli  distri¬ 
bution  with  parameter  p,  where  0  <  p  <  1,  if  its  probability  mass 
function  is  given  by 

Px(l)  =  P(A=l)=p  and  p;f(0)  =  P(X  =  0)  =  1-p. 

We  denote  this  distribution  by  Ber{p). 


Note  that  we  wrote  px  instead  of  p  for  the  probability  mass  function  of  X.  This 
was  done  to  emphasize  its  dependence  on  X  and  to  avoid  possible  confusion 
with  the  parameter  p  of  the  Bernoulli  distribution. 
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4  Discrete  random  variables 


Consider  the  (fictitious)  situation  that  you  attend,  completely  unprepared, 
a  multiple-choice  exam.  It  consists  of  10  questions,  and  each  question  has 
four  alternatives  (of  which  only  one  is  correct).  You  will  pass  the  exam  if 
you  answer  six  or  more  questions  correctly.  You  decide  to  answer  each  of  the 
questions  in  a  random  way,  in  such  a  way  that  the  answer  of  one  question  is 
not  affected  by  the  answers  of  the  others.  What  is  the  probability  that  you 
will  pass? 

Setting  for  z  =  1, 2, . . . ,  10 


1  if  the  zth  answer  is  correct 
0  if  the  zth  answer  is  incorrect 


the  number  of  correct  answers  X  is  given  by 

X  =  Ri  -\-  R2  -\-  R3  -t"  R4  -|-  Rq  R’j  -\-  i?8  -|-  i?9  -|-  R\q  . 

Quick  exercise  4.4  Calculate  the  probability  that  you  answered  the  first 
question  correctly  and  the  second  one  incorrectly. 

Clearly,  X  attains  only  the  values  0, 1, ... ,  10.  Let  us  first  consider  the  case 
X  =  0.  Since  the  answers  to  the  different  questions  do  not  influence  each  other, 
we  conclude  that  the  events  {i?i  =  oi}, . . . ,  {i?io  =  aio}  are  independent  for 
every  choice  of  the  a^,  where  each  Oi  is  0  or  1.  We  find 


P(Y  =  0)  =  P(not  a  single  Ri  equals  1) 


—  P(??i  —  0,  i?2  —  0, . . . ,  Rio  —  0) 

=  P(i?i  =  0)P(i?2  =  0)---P(i?io  =  0) 


The  probability  that  we  have  answered  exactly  one  question  correctly  equals 


which  is  the  probability  that  the  answer  is  correct  times  the  probability  that 
the  other  nine  answers  are  wrong,  times  the  number  of  ways  in  which  this  can 
occur: 


P(X  =  1)  =  P(i?i  =  1)  P(i?2  =  0)  P(i?3  =  0)  •  •  •  P(i?io  =  0) 

+  P(i?i  =  0)  P{R2  =  1)  P(i?3  =  0)  •  •  •  P{Rio  =  0) 

+  P(i?i  =  0)  P{R2  =  0)  P{R3  =  0)  •  •  •  P(i?io  =  1) . 

In  general  we  find  for  /c  =  0, 1, . . . ,  10,  again  using  independence,  that 


4.3  The  Bernoulli  and  binomial  distributions 
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P{X  =  k) 


lO-fc 


•  C'lo.fcj 


which  is  the  probability  that  k  questions  were  answered  correctly  times  the 
probability  that  the  other  10  —  fc  answers  are  wrong,  times  the  number  of  ways 
C'lo./c  this  can  occur. 

So  CiQ^k  is  the  number  of  different  ways  in  which  one  can  choose  k  correct 
answers  from  the  list  of  10.  We  already  have  seen  that  Cio,o  =  because 
there  is  only  one  way  to  do  everything  wrong;  and  that  Ciop  =  10,  because 
each  of  the  10  questions  may  have  been  answered  correctly. 

More  generally,  if  we  have  to  choose  k  different  objects  out  of  an  ordered  list 
of  n  objects,  and  the  order  in  which  we  pick  the  objects  matters,  then  for 
the  first  object  you  have  n  possibilities,  and  no  matter  which  object  you  pick, 
for  the  second  one  there  are  n  —  1  possibilities.  For  the  third  there  are  n  —  2 
possibilities,  and  so  on,  with  n  —  (k  —  l)  possibilities  for  the  fcth.  So  there  are 


n{n  —  1)  •  •  •  (n  —  (fc  —  1)) 


ways  to  choose  the  k  objects. 

In  how  many  ways  can  we  choose  three  questions?  When  the  order  matters, 
there  are  10  •  9  •  8  ways.  However,  the  order  in  which  these  three  questions  are 
selected  does  not  matter:  to  answer  questions  2,  5,  and  8  correctly  is  the  same 
as  answering  questions  8,  2,  and  5  correctly,  and  so  on.  The  triplet  {2,  5,  8} 
can  be  chosen  in  3  •  2  •  1  different  orders,  all  with  the  same  result.  There  are 
six  permutations  of  the  numbers  2,  5,  and  8  (see  page  14). 

Thus,  compensating  for  this  six-fold  overcount,  the  number  Cio^s  of  ways  to 
correctly  answer  3  questions  out  of  10  becomes 


Cio.a  — 


10 -9 -8 

3-2-  1  ■ 


More  generally,  for  n  >  1  and  1  <  fc  <  n. 


a 


n,k 


n{n  —  1)  •  •  •  (n  —  (fc  —  1)) 
k{k-l)  ■■■2-1 


Note  that  this  is  equal  to 

n! 

k\  {n  —  k)V 

which  is  usually  denoted  by  ()!),  so  Cn,k  =  (fc)-  Moreover,  in  accordance  with 
0!  =  1  (as  defined  in  Chapter  2),  we  put  Cn,Q  =  (g)  =  1- 

Quick  exercise  4.5  Show  that  =  (^). 


Substituting  for  Cio^k  we  obtain 
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P{X  =  k) 


Since  P(X  >  6)  =  P(X  =  6)  +  •  •  •  +  P{X  =  10),  it  is  now  an  easy  (but  te¬ 
dious)  exercise  to  determine  the  probability  that  you  will  pass.  One  finds  that 
P(X  >  6)  =  0.0197.  It  pays  to  study,  doesn’t  it?! 

The  preceding  random  variable  X  is  an  example  of  a  random  variable  with  a 
binomial  distribution  with  parameters  n  =  10  and  p  =  1/4. 


Definition.  A  discrete  random  variable  X  has  a  binomial  distri¬ 
bution  with  parameters  n  and  p,  where  n  =  1,2,...  and  0  <  p  <  1, 
if  its  probability  mass  function  is  given  by 

px{k)  =  P(A  =  k)  =  (^P'"  (1  -P)"~^  for  /c  =  0, 1, . .  .,n. 

We  denote  this  distribution  by  Bin{n,p). 


Figure  4.2  shows  the  probability  mass  function  px  and  distribution  function 
Fx  of  a  Bin{10,  j)  distributed  random  variable. 
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Fig.  4.2.  Probability 
distribution. 
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function  of  the  Bin{10,  1) 


4.4  The  geometric  distribution 

In  1986,  Weinberg  and  Gladen  [38]  investigated  the  number  of  menstrual  cy¬ 
cles  it  took  women  to  become  pregnant,  measured  from  the  moment  they  had 
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decided  to  become  pregnant.  We  model  the  number  of  cycles  up  to  pregnancy 
by  a  random  variable  X . 

Assume  that  the  probability  that  a  woman  becomes  pregnant  during  a  partic¬ 
ular  cycle  is  equal  to  p,  for  some  p  with  0  <  p  <  1,  independent  of  the  previous 
cycles.  Then  clearly  P(A  =  1)  =  p.  Due  to  the  independence  of  consecutive 
cycles,  one  finds  for  k  =  1,2,...  that 

P(A  =  k)  =  P(no  pregnancy  in  the  first  k  —  1  cycles,  pregnancy  in  the  fcth) 

=  (i-p)'=-V 

This  random  variable  X  is  an  example  of  a  random  variable  with  a  geometric 
distribution  with  parameter  p. 


Definition.  A  discrete  random  variable  X  has  a  geometric  distri¬ 
bution  with  parameter  p,  where  0  <  p  <  1,  if  its  probability  mass 
function  is  given  by 

Px{k)  =  P(A  =  fc)  =  (1  —  p  for  fc  =  1, 2, . . .  . 

We  denote  this  distribution  by  Geo{p). 


Figure  4.3  shows  the  probability  mass  function  px  and  distribution  function 
Fx  of  a  Geo{j)  distributed  random  variable. 
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Fig.  4.3.  Probability  mass  function  and  distribution  function  of  the  Geo{j)  distri¬ 
bution. 


Quick  exercise  4.6  Let  X  have  a  Geo{p)  distribution.  For  n  >  0,  show  that 
P(W  >  n)  =  (1-p)”. 
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The  geometric  distribution  has  a  remarkable  property,  which  is  known  as  the 
memoryless  property}  For  n,  fc  =  0, 1, 2, . . .  one  has 

F{X  >  n  +  k\X  >  k)  =  P(X  >  n) . 


We  can  derive  this  equality  using  the  result  from  Quick  exercise  4.6: 


P{X  >  n  +  k\X  >  k) 


P{{X  >  k  +  n}  n  {X  >  k}) 
P{X  >  k) 


P{X>k  +  n) 

P{X>k)  ~  (i_p)'= 

=  (l-p)"  =  P(X>n). 


4.5  Solutions  to  the  quick  exercises 

4.1  From  Table  4.1,  one  finds  that 

=  8}  =  {(2, 6),  (3, 5),  (4, 4),  (5, 3),  (6, 2)}. 

4.2  From  Table  4.1,  one  determines  the  following  table. 

k  4  5  6  7  8  9  10  11  12 

^  36  36  36  36  36  36  36  36  36 

4.3  Since  {X  <  a}  =  {X  <  a}  U  {X  =  a},  it  follows  that 

F{a)  =  P{X  <a)  =  P(X  <  a)+P{X  =  a)  =  P{X  <  a)+p{a). 

Not  very  interestingly:  this  also  holds  if  p{a)  =  0. 

4.4  The  probability  that  you  answered  the  first  question  correctly  and  the 
second  one  incorrectly  is  given  by  P(i?i  =  l,i?2  =  0).  Due  to  independence, 
this  is  equal  to  P(i?i  =  1)  P(i?2  =  0)  =  i  '  |  = 

4.5  Rewriting  yields 

(n  \  n!  n!  /  n\ 

n  —  k)  {n  —  k)\ {n  —  (n  —  k))\  kl{n  —  k)l  y/cy 

^  In  fact,  the  geometric  distribution  is  the  only  discrete  random  variable  with  this 
property. 
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4.6  There  are  two  ways  to  show  that  P(X  >  n)  =  (1  —  p)”.  The  easiest  way  is 
to  realize  that  P(X  >  n)  is  the  probability  that  we  had  “no  success  in  the  first 
n  trials,”  which  clearly  equals  (1  —  p)".  A  more  involved  way  is  by  calculation: 


P{X  >n)=  P{X  =  n+l)  +  P{X  =  n  +  2)  +  --- 

=  (l-p)>+(l-pr+V+--- 

=  (1  -  p)>  (1  +  (1  -  p)  +  (1  -  p)2  H - )  . 

If  we  recall  from  calculus  that 


1  -  (1  -p) 


1 

P 


the  answer  follows  immediately. 


4.6  Exercises 

4.1  ffl  Let  Z  represent  the  number  of  times  a  6  appeared  in  two  independent 
throws  of  a  die,  and  let  S  and  M  be  as  in  Section  4.1. 

a.  Describe  the  probability  distribution  of  Z,  by  giving  either  the  probability 
mass  function  pz  of  Z  or  the  distribution  function  Fz  of  Z.  What  type  of 
distribution  does  Z  have,  and  what  are  the  values  of  its  parameters? 

b.  List  the  outcomes  in  the  events  {M  =  2,  Z  =  0},  {S'  =  5,  ^  =  1},  and 
{S  =  8,Z=1}.  What  are  their  probabilities? 

c.  Determine  whether  the  events  {M  =  2}  and  {Z  =  0}  are  independent. 

4.2  Let  A  be  a  discrete  random  variable  with  probability  mass  function  p 
given  by: 


a  -1012 

P(«)  3  i  i  ^ 

and  p(a)  =  0  for  all  other  a. 

a.  Let  the  random  variable  Y  be  defined  by  F  =  A^,  i.e.,  if  A  =  2,  then 
F  =  4.  Calculate  the  probability  mass  function  of  F. 

b.  Calculate  the  value  of  the  distribution  functions  of  A  and  F  in  a  =  1, 
a  =  3/4,  and  a  =  tt  —  3. 

4.3  □  Suppose  that  the  distribution  function  of  a  discrete  random  variable  A 
is  given  by 
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(0 


F{a)  = 


for  a  <  0 
for  0  <  a  <  ^ 
for  i  <  a  <  I 
for  a  >  |. 


Determine  the  probability  mass  function  of  X . 


4.4  You  toss  n  coins,  each  showing  heads  with  probability  p,  independently 
of  the  other  tosses.  Each  coin  that  shows  tails  is  tossed  again.  Let  X  be  the 
total  number  of  heads. 


a.  What  type  of  distribution  does  X  have?  Specify  its  parameter(s). 

b.  What  is  the  probability  mass  function  of  the  total  number  of  heads  XI 

4.5  A  fair  die  is  thrown  until  the  sum  of  the  results  of  the  throws  exceeds  6. 
The  random  variable  X  is  the  number  of  throws  needed  for  this.  Let  F  be  the 
distribution  function  of  X.  Determine  E'(l),  F(2),  and  F{1). 

4.6  □  Three  times  we  randomly  draw  a  number  from  the  following  numbers: 

12  3. 


If  Xi  represents  the  ith  draw,  i  =  1,2,3,  then  the  probability  mass  function 
of  Xi  is  given  by 


a  12  3 

nx^  =  a)  \  i  i 

and  P(W  =  a)  =  0  for  all  other  a.  We  assume  that  each  draw  is  independent 
of  the  previous  draws.  Let  X  be  the  average  of  Xi,  A2,  and  X3,  i.e., 

Xi  +  X2  +  X3 

3 

a.  Determine  the  probability  mass  function  px  of  X. 

b.  Compute  the  probability  that  exactly  two  draws  are  equal  to  1. 

4.7  □  A  shop  receives  a  batch  of  1000  cheap  lamps.  The  odds  that  a  lamp  is 
defective  are  0.1%.  Let  X  be  the  number  of  defective  lamps  in  the  batch. 

a.  What  kind  of  distribution  does  X  have?  What  is/are  the  value  (s)  of  pa¬ 
rameter  (s)  of  this  distribution? 

b.  What  is  the  probability  that  the  batch  contains  no  defective  lamps?  One 
defective  lamp?  More  than  two  defective  ones? 

4.8  □  In  Section  1.4  we  saw  that  each  space  shuttle  has  six  0-rings  and  that 
each  0-ring  fails  with  probability 
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gO+h-t 

pW  =  i  +  e-w 

where  a  =  5.085,  b  =  —0.1156,  and  t  is  the  temperature  (in  degrees  Fahren¬ 
heit)  at  the  time  of  the  launch  of  the  space  shuttle.  At  the  time  of  the  fatal 
launch  of  the  Challenger,  t  =  31,  yielding  p(31)  =  0.8178. 

a.  Let  X  be  the  number  of  failing  0-rings  at  launch  temperature  31°F.  What 
type  of  probability  distribution  does  X  have,  and  what  are  the  values  of 
its  parameters? 

b.  What  is  the  probability  P(A  >  1)  that  at  least  one  0-ring  fails? 

4.9  For  simplicity’s  sake,  let  us  assume  that  all  space  shuttles  will  be  launched 
at  81°F  (which  is  the  highest  recorded  launch  temperature  in  Figure  1.3).  With 
this  temperature,  the  probability  of  an  0-ring  failure  is  equal  top(81)  =  0.0137 
(see  Section  1.4  or  Exercise  4.8). 

a.  What  is  the  probability  that  during  23  launches  no  0-ring  will  fail,  but 
that  at  least  one  0-ring  will  fail  during  the  24th  launch  of  a  space  shuttle? 

b.  What  is  the  probability  that  no  0-ring  fails  during  24  launches? 


4.10  ffl  Early  in  the  morning,  a  group  of  m  people  decides  to  use  the  elevator 
in  an  otherwise  deserted  building  of  21  floors.  Each  of  these  persons  chooses 
his  or  her  floor  independently  of  the  others,  and — from  our  point  of  view — 
completely  at  random,  so  that  each  person  selects  a  floor  with  probability 
1/21.  Let  Sm  be  the  number  of  times  the  elevator  stops.  In  order  to  study 
Sm,  we  introduce  for  i  =  1, 2, . . . ,  21  random  variables  Ri,  given  by 


R^  = 


if  the  elevator  stops  at  the  ith  floor 

if  the  elevator  does  not  stop  at  the  ith  floor. 


a.  Each  Ri  has  a  Ber{p)  distribution.  Show  that  p  =1  —  (ff)™- 

b.  From  the  way  we  defined  Sm,  it  follows  that 

Sm  =  Ri  +  R2  +  •  •  •  +  R21  ■ 

Can  we  conclude  that  Sm  has  a  Bin{21,p)  distribution,  with  p  as  in  part  a? 
Why  or  why  not? 

c.  Clearly,  if  m  =  1,  one  has  that  P(S'i  =  1)  =  1.  Show  that  for  m  =  2 

P(^2  =  1)  =  ^  =  1-P(52  =  2), 

and  that  S3  has  the  following  distribution. 

a  1  2  3 


P(S'3  =  a)  1/441  60/441  380/441 
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4.11  You  decide  to  play  monthly  in  two  different  lotteries,  and  you  stop  play¬ 
ing  as  soon  as  you  win  a  prize  in  one  (or  both)  lotteries  of  at  least  one  million 
euros.  Suppose  that  every  time  you  participate  in  these  lotteries,  the  proba¬ 
bility  to  win  one  million  (or  more)  euros  is  pi  for  one  of  the  lotteries  and  p2 
for  the  other.  Let  M  be  the  number  of  times  you  participate  in  these  lotteries 
until  winning  at  least  one  prize.  What  kind  of  distribution  does  M  have,  and 
what  is  its  parameter? 

4.12  □  You  and  a  friend  want  to  go  to  a  concert,  but  unfortunately  only  one 
ticket  is  still  available.  The  man  who  sells  the  tickets  decides  to  toss  a  coin 
until  heads  appears.  In  each  toss  heads  appears  with  probability  p,  where 
0  <  p  <  1,  independent  of  each  of  the  previous  tosses.  If  the  number  of  tosses 
needed  is  odd,  your  friend  is  allowed  to  buy  the  ticket;  otherwise  you  can  buy 
it.  Would  you  agree  to  this  arrangement? 

4.13  ffl  A  box  contains  an  unknown  number  N  of  identical  bolts.  In  order 
to  get  an  idea  of  the  size  N,  we  randomly  mark  one  of  the  bolts  from  the 
box.  Next  we  select  at  random  a  bolt  from  the  box.  If  this  is  the  marked  bolt 
we  stop,  otherwise  we  return  the  bolt  to  the  box,  and  we  randomly  select  a 
second  one,  etc.  We  stop  when  the  selected  bolt  is  the  marked  one.  Let  X  be 
the  number  of  times  a  bolt  was  selected.  Later  (in  Exercise  21.11)  we  will  try 
to  find  an  estimate  of  N.  Here  we  look  at  the  probability  distribution  of  X. 

a.  What  is  the  probability  distribution  of  XI  Specify  its  parameter(s)! 

b.  The  drawback  of  this  approach  is  that  X  can  attain  any  of  the  values 
1, 2, 3, ... ,  so  that  if  N  is  large  we  might  be  sampling  from  the  box  for 
quite  a  long  time.  We  decide  to  sample  from  the  box  in  a  slightly  different 
way:  after  we  have  randomly  marked  one  of  the  bolts  in  the  box,  we 
select  at  random  a  bolt  from  the  box.  If  this  is  the  marked  one,  we  stop, 
otherwise  we  randomly  select  a  second  bolt  (we  do  not  return  the  selected 
bolt).  We  stop  when  we  select  the  marked  bolt.  Let  Y  be  the  number  of 
times  a  bolt  was  selected. 

Show  that  P(Y  =  k)  =  1/N  for  fc  =  1,  2, . . . ,  (Y  has  a  so-called  discrete 
uniform  distribution). 

c.  Instead  of  randomly  marking  one  bolt  in  the  box,  we  mark  m  bolts,  with 
m  smaller  than  N.  Next,  we  randomly  select  r  bolts;  Z  is  the  number  of 
marked  bolts  in  the  sample. 

Show  that 

/m\  /N—m\ 

=  =  for  fc  =  0,l,2,...,r. 

{Z  has  a  so-called  hypergeometric  distribution,  with  parameters  m,  TV, 
and  r.) 

4.14  We  throw  a  coin  until  a  head  turns  up  for  the  second  time,  where  p  is  the 
probability  that  a  throw  results  in  a  head  and  we  assume  that  the  outcome 


4.6  Exercises 
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of  each  throw  is  independent  of  the  previous  outcomes.  Let  X  be  the  number 
of  times  we  have  thrown  the  coin. 

a.  Determine  P{X  =  2),  P{X  =  3),  and  P{X  =  4). 

b.  Show  that  P(X  =  n)  =  {n  —  l)p^(l  —  for  n  >  2. 


5 


Continuous  random  variables 


Many  experiments  have  outcomes  that  take  values  on  a  continuous  scale.  For 
example,  in  Chapter  2  we  encountered  the  load  at  which  a  model  of  a  bridge 
collapses.  These  experiments  have  continuous  random  variables  naturally  as¬ 
sociated  with  them. 


5.1  Probability  density  functions 

One  way  to  look  at  continuous  random  variables  is  that  they  arise  by  a  (never- 
ending)  process  of  refinement  from  discrete  random  variables.  Suppose,  for 
example,  that  a  discrete  random  variable  associated  with  some  experiment 
takes  on  the  value  6.283  with  probability  p.  If  we  refine,  in  the  sense  that  we 
also  get  to  know  the  fourth  decimal,  then  the  probability  p  is  spread  over  the 
outcomes  6.2830, 6.2831, . . . ,  6.2839.  Usually  this  will  mean  that  each  of  these 
new  values  is  taken  on  with  a  probability  that  is  much  smaller  than  p — the 
sum  of  the  ten  probabilities  is  p.  Continuing  the  refinement  process  to  more 
and  more  decimals,  the  probabilities  of  the  possible  values  of  the  outcomes 
become  smaller  and  smaller,  approaching  zero.  However,  the  probability  that 
the  possible  values  lie  in  some  fixed  interval  [a,  b]  will  settle  down.  This  is 
closely  related  to  the  way  sums  converge  to  an  integral  in  the  definition  of  the 
integral  and  motivates  the  following  definition. 


Definition.  A  random  variable  X  is  continuous  if  for  some  function 
/  :  R  ^  M  and  for  any  numbers  a  and  b  with  a  <  b, 

P(o  <  X  <  b)  =  (  f{x)  dx. 

J  a 

The  function  /  has  to  satisfy  /(x)  >  0  for  all  x  and  /(x)  dx  =  1. 
We  call  /  the  probability  density  function  (or  probability  density) 
of  X. 
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Fig.  5.1.  Area  under  a  probability  density  function  /  on  the  interval  [a,  b]. 


Note  that  the  probability  that  X  lies  in  an  interval  [a,  b]  is  equal  to  the  area 
under  the  probability  density  function  f  oi  X  over  the  interval  [a,  6] ;  this 
is  illustrated  in  Figure  5.1.  So  if  the  interval  gets  smaller  and  smaller,  the 
probability  will  go  to  zero:  for  any  positive  e 

ra+e 

P{a  —  e  <  X  <  a  +  e)  =  /  f{x)dx, 

J  a—e 

and  sending  e  to  0,  it  follows  that  for  any  a 

P(X  =  a)  =  0. 

This  implies  that  for  continuous  random  variables  you  may  be  careless  about 
the  precise  form  of  the  intervals: 

V{a<X<h)  =  V{a<X<h)  =  V{a<X  <h)  =  V{a<X  <h). 

What  does  /(a)  represent?  Note  (see  also  Figure  5.2)  that 

/*a+£ 

P{a  —  e  <  X  <  a  +  e)  =  /  /(x)  dx  «  2e/(a)  (5.1) 

J  a—e 

for  small  positive  e.  Hence  /(a)  can  be  interpreted  as  a  (relative)  measure  of 
how  likely  it  is  that  X  will  be  near  a.  However,  do  not  think  of  /(a)  as  a 
probability:  /(a)  can  be  arbitrarily  large.  An  example  of  such  an  /  is  given  in 
the  following  exercise. 

Quick  exercise  5.1  Let  the  function  /  be  defined  by  f{x)  =  0  if  x  <  0 
or  X  >  1,  and  /(x)  =  l/{2^/x)  for  0  <  x  <  1.  You  can  check  quickly  that 
/  satisfies  the  two  properties  of  a  probability  density  function.  Let  X  be 
a  random  variable  with  /  as  its  probability  density  function.  Compute  the 
probability  that  X  lies  between  10“^  and  10“^. 


5.1  Probability  density  functions 
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You  should  realize  that  discrete  random  variables  do  not  have  a  probability 
density  function  /  and  continuous  random  variables  do  not  have  a  probability 
mass  function  p,  but  that  both  have  a  distribution  function  F{a)  =  P(Y  <  a). 
Using  the  fact  that  for  a  <  6  the  event  {X  <  6}  is  a  disjoint  union  of  the 
events  {X  <  a}  and  {a  <  X  <  6},  we  can  express  the  probability  that  X  lies 
in  an  interval  (a,  b]  directly  in  terms  of  F  for  both  cases: 

P(a  <  X  <  6)  =  P(X  <  6)  -  P(X  <  a)  =  F{b)  -  F{a). 

There  is  a  simple  relation  between  the  distribution  function  F  and  the  prob¬ 
ability  density  function  /  of  a  continuous  random  variable.  It  follows  from 
integral  calculus  that 

^(^)  =  f  f{x)dx  and^  f{x)  =  -^F{x). 

J  —  OO 

Both  the  probability  density  function  and  the  distribution  function  of  a  con¬ 
tinuous  random  variable  X  contain  all  the  probabilistic  information  about  X ; 
the  probability  distribution  of  X  is  described  by  either  of  them. 

We  illustrate  all  this  with  an  example.  Suppose  we  want  to  make  a  probability 
model  for  an  experiment  that  can  be  described  as  “an  object  hits  a  disc  of 
radius  r  in  a  completely  arbitrary  way”  (of  course,  this  is  not  you  playing 
darts — nevertheless  we  will  refer  to  this  example  as  the  darts  example).  We 
are  interested  in  the  distance  X  between  the  hitting  point  and  the  center  of 
the  disc.  Since  distances  cannot  be  negative,  we  have  F{b)  =  P(X  <  b)  =  0 
when  &  <  0.  Since  the  object  hits  the  disc,  we  have  F(b)  =  1  when  b  >  r.  That 
the  dart  hits  the  disk  in  a  completely  arbitrary  way  we  interpret  as  that  the 
probability  of  hitting  any  region  is  proportional  to  the  area  of  that  region.  In 
particular,  because  the  disc  has  area  irr'^  and  the  disc  with  radius  b  has  area 
nb^,  we  should  put 

^  This  holds  for  all  x  where  /  is  continuous. 
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F{b)  =  F{X  <b)  =  ^  =  ^  for  0  <  6  <  r. 

7rr^ 

Then  the  probability  density  function  /  of  X  is  equal  to  0  outside  the  interval 
[0,r]  and 

H  1  H 

/(^)  =  for  0  <  a;  <  r. 

ax  r‘‘  dx 

Quick  exercise  5.2  Compute  for  the  darts  example  the  probability  that 
0  <  X  <  r/2,  and  the  probability  that  r/2  <  X  <  r. 


5.2  The  uniform  distribution 

In  this  section  we  encounter  a  continuous  random  variable  that  describes  an 
experiment  where  the  outcome  is  completely  arbitrary,  except  that  we  know 
that  it  lies  between  certain  bounds.  Many  experiments  of  physical  origin  have 
this  kind  of  behavior.  For  instance,  suppose  we  measure  for  a  long  time  the 
emission  of  radioactive  particles  of  some  material.  Suppose  that  the  experi¬ 
ment  consists  of  recording  in  each  hour  at  what  times  the  particles  are  emitted. 
Then  the  outcomes  will  lie  in  the  interval  [0,60]  minutes.  If  the  measurements 
would  concentrate  in  any  way,  there  is  either  something  wrong  with  your 
Geiger  counter  or  you  are  about  to  discover  some  new  physical  law.  Not  con¬ 
centrating  in  any  way  means  that  subintervals  of  the  same  length  should  have 
the  same  probability.  It  is  then  clear  (cf.  equation  (5.1))  that  the  probability 
density  function  associated  with  this  experiment  should  be  constant  on  [0, 60]. 
This  motivates  the  following  definition. 


Definition.  A  continuous  random  variable  has  a  uniform  distribu¬ 
tion  on  the  interval  [a,  (3\  if  its  probability  density  function  /  is  given 
by  /(^)  =  0  if  X  is  not  in  [a,  /3]  and 

—  71 -  for  a  <  X  <  /3. 

p  —  a 

We  denote  this  distribution  by  U{a,P). 

Quick  exercise  5.3  Argue  that  the  distribution  function  F  of  a  random 
variable  that  has  a  U{a,P)  distribution  is  given  by  F{x)  =  0  if  x  <  a, 
F(x)  =  1  if  X  >  /3,  and  F(x)  =  (x  —  a) /(/3  —  a)  for  a  <  x  <  /3. 

In  Figure  5.3  the  probability  density  function  and  the  distribution  function  of 
a  U{0,^)  distribution  are  depicted. 


5.3  The  exponential  distribution 
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Fig.  5.3.  The  probability  density  function  and  the  distribution  function  of  the 
U{0,^)  distribution. 

5.3  The  exponential  distribution 

We  already  encountered  the  exponential  distribution  in  the  chemical  reactor 
example  of  Chapter  3.  We  will  give  an  argument  why  it  appears  in  that  ex¬ 
ample.  Let  V  be  the  effluent  volumetric  flow  rate,  i.e.,  the  volume  that  leaves 
the  reactor  over  a  time  interval  [0,  t]  is  vt  (and  an  equal  volume  enters  the 
vessel  at  the  other  end).  Let  V  be  the  volume  of  the  reactor  vessel.  Then  in 
total  a  fraction  (v/V)  ■  t  will  have  left  the  vessel  during  [0,t],  when  t  is  not 
too  large.  Let  the  random  variable  T  be  the  residence  time  of  a  particle  in 
the  vessel.  To  compute  the  distribution  of  T,  we  divide  the  interval  [0,  t]  in 
n  small  intervals  of  equal  length  t/n.  Assuming  perfect  mixing,  so  that  the 
particle’s  position  is  uniformly  distributed  over  the  volume,  the  particle  has 
probability  p  =  (v/V)  ■t/n  to  have  left  the  vessel  during  any  of  the  n  intervals 
of  length  t/n.  If  we  assume  that  the  behavior  of  the  particle  in  different  time 
intervals  of  length  t/n  is  independent,  we  have,  if  we  call  “leaving  the  vessel” 
a  success,  that  T  has  a  geometric  distribution  with  success  probability  p.  It 
follows  (see  also  Quick  exercise  4.6)  that  the  probability  V(T  >  t)  that  the 
particle  is  still  in  the  vessel  at  time  t  is,  for  large  n,  well  approximated  by 


But  then,  letting  n  oo,  we  obtain  (recall  a  well-known  limit  from  your 
calculus  course) 


It  follows  that  the  distribution  function  of  T  equals  1  —  e  and  differenti¬ 


ating  we  obtain  that  the  probability  density  function  fx  of  T  is  equal  to 
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/T(t)  =  -^(l-e  ^‘)  = -^e  V*  for  f  >  0. 

^  ^  dr  '  V 

This  is  an  example  of  an  exponential  distribution,  with  parameter  v /V . 

Definition.  A  continuous  random  variable  has  an  exponential  dis¬ 
tribution  with  parameter  A  if  its  probability  density  function  /  is 
given  by  f{x)  =  0  if  x  <  0  and 

f{x)  =  Xe-^^  for  a;  >  0. 

We  denote  this  distribution  by  Exp{X). 

The  distribution  function  F  of  an  Exp{X)  distribution  is  given  by 
F{a)  =  1  -  e-^“  for  a  >  0. 

In  Figure  5.4  we  show  the  probability  density  function  and  the  distribution 
function  of  the  Exp  {Q. 2b)  distribution. 


Fig.  5.4.  The  probability  density  and  the  distribution  function  of  the  Exp  (Q. 2b) 
distribution. 


Since  we  obtained  the  exponential  distribution  directly  from  the  geometric 
distribution  it  should  not  come  as  a  surprise  that  the  exponential  distribution 
also  satisfies  the  memoryless  property,  i.e.,  if  X  has  an  exponential  distribu¬ 
tion,  then  for  all  s,t  >  0, 

P(A  >  s  -b  1 1  A  >  s)  =  P(A  >  t) . 

Actually,  this  follows  directly  from 

P(X>s-bt) 

P(A  >  s)  ~  e-^« 


P{X  >  s  +  t\X  >  s) 


P{X>t). 


5.4  The  Pareto  distribution 
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Quick  exercise  5.4  A  study  of  the  response  time  of  a  certain  computer  sys¬ 
tem  yields  that  the  response  time  in  seconds  has  an  exponentially  distributed 
time  with  parameter  0.25.  What  is  the  probability  that  the  response  time 
exceeds  5  seconds? 


5.4  The  Pareto  distribution 

More  than  a  century  ago  the  economist  Vilfredo  Pareto  ([20])  noticed  that 
the  number  of  people  whose  income  exceeded  level  x  was  well  approximated 
by  Clx°‘,  for  some  constants  C  and  a  >  0  (it  appears  that  for  all  countries 
a  is  around  1.5).  A  similar  phenomenon  occurs  with  city  sizes,  earthquake 
rupture  areas,  insurance  claims,  and  sizes  of  commercial  companies.  When 
these  quantities  are  modeled  as  realizations  of  random  variables  A,  then  their 
distribution  functions  are  of  the  type  F{x)  =  1  —  l/a:“  for  a:  >  1.  (Here 
1  is  a  more  or  less  arbitrarily  chosen  starting  point — what  matters  is  the 
behavior  for  large  x.)  Differentiating,  we  obtain  probability  densities  of  the 
form  f{x)  =  alx°‘^^ .  This  motivates  the  following  definition. 


Deeinition.  a  continuous  random  variable  has  a  Pareto  distribution 
with  parameter  a  >  0  if  its  probability  density  function  /  is  given 
by  f{x)  =  0  if  x  <  1  and 

/(^)  =  ^  ^  ^  1- 

We  denote  this  distribution  by  Par  {a). 


Fig.  5.5.  The  probability  density  and  the  distribution  function  of  the  Par{0.5) 
distribution. 
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In  Figure  5.5  we  depicted  the  probability  density  /  and  the  distribution  func¬ 
tion  F  of  the  Par (0.5)  distribution. 


5.5  The  normal  distribution 

The  normal  distribution  plays  a  central  role  in  probability  theory  and  statis¬ 
tics.  One  of  its  first  applications  was  due  to  C.F.  Gauss,  who  used  it  in  1809 
to  model  observational  errors  in  astronomy;  see  [13].  We  will  see  in  Chap¬ 
ter  14  that  the  normal  distribution  is  an  important  tool  to  approximate  the 
probability  distribution  of  the  average  of  independent  random  variables. 


Definition.  A  continuous  random  variable  has  a  normal  distribu¬ 
tion  with  parameters  /r  and  cr^  >  0  if  its  probability  density  function 
/  is  given  by 

1  1  f  x-u-Y 

f{x)  =  — e  ^  ^  for  —  oo  <  x  <  oo. 

(TV  27r 

We  denote  this  distribution  by  iV(^,  cr^). 


In  Figure  5.6  the  graphs  of  the  probability  density  function  /  and  distribution 
function  F  of  the  normal  distribution  with  /r  =  3  and  cr^  =  6.25  are  displayed. 


Fig.  5.6.  The  probability  density  and  the  distribution  function  of  the  A(3,  6.25) 
distribution. 


If  X  has  an  A(/r,  cr^)  distribution,  then  its  distribution  function  is  given  by 


F{a) 


e 


_i 

2 


( 


X  —  fl 
(7 


2 


dx  for  — oo  <  a  <  oo. 


5.6  Quantiles 
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Unfortunately  there  is  no  explicit  expression  for  F\  f  has  no  antiderivative. 
However,  as  we  shall  see  in  Chapter  8,  any  distributed  random  vari¬ 

able  can  be  turned  into  an  N(0, 1)  distributed  random  variable  by  a  simple 
transformation.  As  a  consequence,  a  table  of  the  A^(0, 1)  distribution  suffices. 
The  latter  is  called  the  standard  normal  distribution,  and  because  of  its  special 
role  the  letter  (j)  has  been  reserved  for  its  probability  density  function: 

for  —  oo  <  a;  <  oo. 

V  27r 

Note  that  (f)  is  symmetric  around  zero:  (p^—x)  =  for  each  x.  The  corre¬ 
sponding  distribution  function  is  denoted  by  <i>.  The  table  for  the  standard  nor¬ 
mal  distribution  (see  Table  B.l)  does  not  contain  the  values  of  d)(a),  but  rather 
the  so-called  right  tail  probabilities  1  —  d>(a).  If,  for  instance,  we  want  to  know 
the  probability  that  a  standard  normal  random  variable  Z  is  smaller  than  or 
equal  to  1,  we  use  that  P(Z  <  1)  =  1  —  P(Z  >  1).  In  the  table  we  find  that 
P(Z  >  1)  =  is  equal  to  0.1587.  Hence  P(Z  <  1)  =  1-0.1587  =  0.8413. 

With  the  table  you  can  handle  tail  probabilities  with  numbers  a  given  to  two 
decimals.  To  find,  for  instance,  P{Z  >  1.07),  we  stay  in  the  same  row  in  the 
table  but  move  to  the  seventh  column  to  find  that  P{Z  >  1.07)  =  0.1423. 

Quick  exercise  5.5  Let  the  random  variable  Z  have  a  standard  normal 
distribution.  Use  Table  B.l  to  find  P{Z  <  0.75).  How  do  you  know — without 
doing  any  calculations — that  the  answer  should  be  larger  than  0.5? 


5.6  Quantiles 

Recall  the  chemical  reactor  example,  where  the  residence  time  T,  measured 
in  minutes,  has  an  exponential  distribution  with  parameter  A  =  v/V  =  0.25. 
As  we  shall  see  in  the  next  chapters,  a  consequence  of  this  choice  of  A  is  that 
the  mean  time  the  particle  stays  in  the  vessel  is  4  minutes.  However,  from  the 
viewpoint  of  process  control  this  is  not  the  quantity  of  interest.  Often,  there 
will  be  some  minimal  amount  of  time  the  particle  has  to  stay  in  the  vessel  to 
participate  in  the  chemical  reaction,  and  we  would  want  that  at  least  90%  of 
the  particles  stay  in  the  vessel  this  minimal  amount  of  time.  In  other  words, 
we  are  interested  in  the  number  q  with  the  property  that  P(T  >  q)  =  0.9,  or 
equivalently, 

P(T  <q)  =  0.1. 

The  number  q  is  called  the  0.1th  quantile  or  10th  percentile  of  the  distribution. 
In  the  case  at  hand  it  is  easy  to  determine.  We  should  have 

P(T  <q)  =  1  =  0.1. 

This  holds  exactly  when  =  0.9  or  when  —0.25q  =  ln(0.9)  =  —0.105. 

So  q  =  0.42.  Hence,  although  the  mean  residence  time  is  4  minutes,  10%  of 
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the  particles  stays  less  than  0.42  minute  in  the  vessel,  which  is  just  slightly 
more  than  25  seconds!  We  use  the  following  general  definition. 


Definition.  Let  X  be  a  continuous  random  variable  and  let  p  be  a 
number  between  0  and  1.  The  pth  quantile  or  lOOpth  percentile  of 
the  distribution  of  X  is  the  smallest  number  qp  such  that 

Fi.<lp)  <  <?p)  =P- 

The  median  of  a  distribution  is  its  50th  percentile. 

Quick  exercise  5.6  What  is  the  median  of  the  C/(2,7)  distribution? 

For  continuous  random  variables  qp  is  often  easy  to  determine.  Indeed,  if  F  is 
strictly  increasing  from  0  to  1  on  some  interval  (which  may  be  infinite  to  one 
or  both  sides),  then 

qp  =  F“'^(p), 

where  F™''  is  the  inverse  of  F.  This  is  illustrated  in  Figure  5.7  for  the 
Exp  {0.25)  distribution. 


For  an  exponential  distribution  it  is  easy  to  compute  quantiles.  This  is  dif¬ 
ferent  for  the  standard  normal  distribution,  where  we  have  to  use  a  table 
(like  Table  B.l).  For  example,  the  90th  percentile  of  a  standard  normal  is  the 
number  go. 9  such  that  <I)(go.9)  =  0.9,  which  is  the  same  as  1  —  4)(go.9)  =  0.1, 
and  the  table  gives  us  go. 9  =  1.28.  This  is  illustrated  in  Figure  5.8,  with  both 
the  probability  density  function  and  the  distribution  function  of  the  standard 
normal  distribution. 

Quick  exercise  5.7  Find  the  0.95th  quantile  go, 95  of  a  standard  normal 
distribution,  accurate  to  two  decimals. 


5.7  Solutions  to  the  quick  exercises 
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5.1  We  know  from  integral  calculus  that  for  0  <  a  <  6  <  1 

[  f{x)  dx=  [  dx  =  '/b-  y/a. 

Ja  Ja  2VX 

Hence  f{x)  dx  =  l/(2^/x)  dx  =  1  (so  /  is  a  probability  density 

function — nonnegativity  being  obvious),  and 

.10-^  1 

P(10-^  <  X  <  10-2)  =  /  — ^dx 

=  VlO-2  -  VlO-4  =  10"^  -  10-2  =  0.09. 
Actually,  the  random  variable  X  arises  in  a  natural  way;  see  equation  (7.1). 

5.2  We  have  P(0  <  X  <  r/2)  =  F{r/2)  -  F(0)  =  (1/2)2  _  92  =  1/4^  and 
P(r/2  <  X  <  r)  =  F{r)  —  F{r/2)  =  1  —  1/4  =  3/4,  no  matter  what  the  radius 
of  the  disc  is! 


5.3  Since  f(x)  =  0  for  x  <  a,  we  have  F{x)  =  0  if  x  <  a.  Also,  since  /(x)  =  0 
for  all  X  >  P,  F{x)  =  1  if  x  >  /3.  In  between 


F{x)  =  f  f{y)dy=  j 

J  —00  J  a. 


P  —  a 


dy  = 


P  —  a 


P  —  a 


In  other  words;  the  distribution  function  increases  linearly  from  the  value  0 
in  a  to  the  value  1  in  p. 


5.4  If  X  is  the  response  time,  we  ask  for  P(X  >  5).  This  equals 


P(X  >  5)  =  e-°-25'5  =  e-^-25 


=  0.2865...  . 
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5.5  In  the  eighth  row  and  sixth  column  of  the  table,  we  find  that  1  — <I)(0.75)  = 
0.2266.  Hence  the  answer  is  1  —  0.2266  =  0.7734.  Because  of  the  symmetry  of 
the  probability  density  (p,  half  of  the  mass  of  a  standard  normal  distribution 
lies  on  the  negative  axis.  Hence  for  any  number  a  >  0,  it  should  be  true  that 
V{Z  <a)>  V{Z  <  0)  =  0.5. 

5.6  The  median  is  the  number  go. 5  =  i^'"''(0.5).  You  either  see  directly  that 
you  have  got  half  of  the  mass  to  both  sides  of  the  middle  of  the  interval,  hence 
<?o.5  =  (2  +  7)/2  =  4.5,  or  you  solve  with  the  distribution  function: 

l  =  =  and  so  g  =  4.5. 

5.7  Since  4)(go.95)  =  0.95  is  the  same  as  1  —  4)(go.95)  =  0.05,  the  table  gives 
us  go. 95  =  1.64,  or  more  precisely,  if  we  interpolate  between  the  fourth  and 
the  fifth  column;  1.645. 


5.8  Exercises 


5.1  Let  Y  be  a  continuous  random  variable  with  probability  density  function 


f{x)  = 


for  0  <  a;  <  1 
for  2  <  a;  <  3 
elsewhere. 


a.  Draw  the  graph  of  /. 

b.  Determine  the  distribution  function  F  of  X,  and  draw  its  graph. 


5.2  □  Let  Y  be  a  random  variable  that  takes  values  in  [0, 1],  and  is  further 
given  by 

F{x)  =  for  0  <  a;  <  1. 

Compute  P(^  <  Y  <  |). 

5.3  Let  a  continuous  random  variable  Y  be  given  that  takes  values  in  [0, 1], 
and  whose  distribution  function  F  satisfies 

F{x)  =  2x^  —  x"^  for  0  <  a:  <  1. 

a.  Compute  P(|  <  Y  <  |). 

b.  What  is  the  probability  density  function  of  Y? 


5.4  ffl  Jensen,  arriving  at  a  bus  stop,  just  misses  the  bus.  Suppose  that  he 
decides  to  walk  if  the  (next)  bus  takes  longer  than  5  minutes  to  arrive.  Suppose 
also  that  the  time  in  minutes  between  the  arrivals  of  buses  at  the  bus  stop  is 
a  continuous  random  variable  with  a  C/(4,6)  distribution.  Let  Y  be  the  time 
that  Jensen  will  wait. 


5.8  Exercises 
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a.  What  is  the  probability  that  X  is  less  than  4^  (minutes)? 

b.  What  is  the  probability  that  X  equals  5  (minutes)? 

c.  Is  X  a  discrete  random  variable  or  a  continuous  random  variable? 

5.5  □  The  probability  density  function  /  of  a  continuous  random  variable  X 
is  given  by: 


a.  Compute  c. 

b.  Compute  the  distribution  function  of  X . 

5.6  Let  X  have  an  Exp{0.2)  distribution.  Compute  P(X  >  5). 

5.7  The  score  of  a  student  on  a  certain  exam  is  represented  by  a  number 
between  0  and  1.  Suppose  that  the  student  passes  the  exam  if  this  number 
is  at  least  0.55.  Suppose  we  model  this  experiment  by  a  continuous  random 
variable  S,  the  score,  whose  probability  density  function  is  given  by 


4a:  for  0  <  a;  <  i 

4  —  4a;  for  ^  <  a;  <  1 
0  elsewhere. 


fix) 


a.  What  is  the  probability  that  the  student  fails  the  exam? 

b.  What  is  the  score  that  he  will  obtain  with  a  50%  chance,  in  other  words, 
what  is  the  50th  percentile  of  the  score  distribution? 

5.8  ffl  Consider  Quick  exercise  5.2.  For  another  dart  thrower  it  is  given  that  his 
distance  to  the  center  of  the  disc  Y  is  described  by  the  following  distribution 
function: 


for  0  <  b  <  r 


and  G{b)  =  0  for  6  <  0,  G{b)  =  1  for  5  >  r. 

a.  Sketch  the  probability  density  function  g{y)  =  ■^G{y). 

b.  Is  this  person  “better”  than  the  person  in  Quick  exercise  5.2? 

c.  Sketch  a  distribution  function  associated  to  a  person  who  in  90%  of  his 
throws  hits  the  disc  no  further  than  0.1  ■  r  of  the  center. 

5.9  □  Suppose  we  choose  arbitrarily  a  point  from  the  square  with  corners  at 
(2,1),  (3,1),  (2,2),  and  (3,2).  The  random  variable  A  is  the  area  of  the  triangle 
with  its  corners  at  (2,1),  (3,1)  and  the  chosen  point  (see  Figure  5.9). 

a.  What  is  the  largest  area  A  that  can  occur,  and  what  is  the  set  of  points 
for  which  A  <  1/4? 


70 


5  Continuous  random  variables 


(2,2)  (3,2) 


Fig.  5.9.  A  triangle  in  a  square. 


b.  Determine  the  distribution  function  F  of  A. 

c.  Determine  the  probability  density  function  /  of  A. 

5.10  Consider  again  the  chemical  reactor  example  with  parameter  A  =  0.5. 
We  saw  in  Section  5.6  that  10%  of  the  particles  stay  in  the  vessel  no  longer 
than  about  12  seconds — while  the  mean  residence  time  is  2  minutes.  Which 
percentage  of  the  particles  stay  no  longer  than  2  minutes  in  the  vessel? 

5.11  Compute  the  median  of  an  Exp{\)  distribution. 

5.12  □  Compute  the  median  of  a  Par{l)  distribution. 

5.13  ffl  We  consider  a  random  variable  Z  with  a  standard  normal  distribution. 

a.  Show  why  the  symmetry  of  the  probability  density  function  (f>  oi  Z  implies 
that  for  any  a  one  has  <&(— a)  =  1  —  d)(a). 

b.  Use  this  to  compute  P(Z  <  —2). 

5.14  Determine  the  10th  percentile  of  a  standard  normal  distribution. 
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Sometimes  probabilistic  models  are  so  complex  that  the  tools  of  mathemat¬ 
ical  analysis  are  not  sufficient  to  answer  all  relevant  questions  about  them. 
Stochastic  simulation  is  an  alternative  approach:  values  are  generated  for  the 
random  variables  and  inserted  into  the  model,  thus  mimicking  outcomes  for 
the  whole  system.  It  is  shown  in  this  chapter  how  one  can  use  uniform  ran¬ 
dom  number  generators  to  mimic  random  variables.  Also  two  larger  simulation 
examples  are  presented. 


6.1  What  is  simulation? 

In  many  areas  of  science,  technology,  government,  and  business,  models  are 
used  to  gain  understanding  of  some  part  of  reality  (the  portion  of  interest  is 
often  referred  to  as  “the  system”).  Sometimes  these  are  physical  models,  such 
as  a  scale  model  of  an  airplane  in  a  wind  tunnel  or  a  scale  model  of  a  chemical 
plant.  Other  models  are  abstract,  such  as  macroeconomic  models  consisting 
of  equations  relating  things  like  interest  rates,  unemployment,  and  inflation 
or  partial  differential  equations  describing  global  weather  patterns. 

In  simulation,  one  uses  a  model  to  create  specific  situations  in  order  to  study 
the  response  of  the  model  to  them  and  then  interprets  this  in  terms  of  what 
would  happen  to  the  system  “in  the  real  world.”  In  this  way,  one  can  carry 
out  experiments  that  are  impossible,  too  dangerous,  or  too  expensive  to  do 
in  the  real  world — addressing  questions  like:  What  happens  to  the  average 
temperature  if  we  reduce  the  greenhouse  gas  emissions  globally  by  50%?  Can 
the  plane  still  fly  if  engines  3  and  4  stop  in  midair?  What  happens  to  the 
distribution  of  wealth  if  we  halve  the  tax  rate? 

More  specifically,  we  focus  on  situations  and  problems  where  randomness  or 
uncertainty  or  both  play  a  significant  or  dominant  role  and  should  be  modeled 
explicitly.  Models  for  such  systems  involve  random  variables,  and  we  speak  of 
probabilistic  or  stochastic  models.  Simulating  them  is  stochastic  simulation.  In 
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the  preceding  chapters  we  have  encountered  some  of  the  tools  of  probability 
theory,  and  we  will  encounter  others  in  the  chapters  to  come.  With  these  tools 
we  can  compute  quantities  of  interest  explicitly  for  many  models.  Stochastic 
simulation  of  a  system  means  generating  values  for  all  the  random  variables 
in  the  model,  according  to  their  specified  distributions,  and  recording  and 
analyzing  what  happens.  We  refer  to  the  generated  values  as  realizations  of 
the  random  variables. 

For  us,  there  are  two  reasons  to  learn  about  stochastic  simulation.  The  first  is 
that  for  complex  systems,  simulation  can  be  an  alternative  to  mathematical 
analysis,  sometimes  the  only  one.  The  second  reason  is  that  through  simula¬ 
tion,  we  can  get  more  feeling  for  random  variables,  and  this  is  why  we  study 
stochastic  simulation  at  this  point  in  the  book.  We  start  by  asking  how  we 
can  generate  a  realization  of  a  random  variable. 


6.2  Generating  realizations  of  random  variables 

Simulations  are  almost  always  done  using  computers,  which  usually  have  one 
or  more  so-called  (pseudo)  random  number  generators.  A  call  to  the  random 
number  generator  returns  a  random  number  between  0  and  1,  which  mimics 
a  realization  of  a  f7(0, 1)  variable.  With  this  source  of  uniform  (pseudo)  ran¬ 
domness  we  can  construct  any  random  variable  we  want  by  transforming  the 
outcome,  as  we  shall  see. 

Quick  exercise  6.1  Describe  how  you  can  simulate  a  coin  toss  when  instead 
of  a  coin  you  have  a  die.  Any  ideas  on  how  to  simulate  a  roll  of  a  die  if  you 
only  have  a  coin? 

Bernoulli  random  variables 

Suppose  U  has  a  17(0, 1)  distribution.  To  construct  a  Ber{p)  random  variable 
for  some  0  <  p  <  1,  we  define 


1  if  17  <  p, 
0  if  17  >  p 


so  that 


P(X=l)  =  P(t7<p)=p, 
P(X  =  0)  =  P(f7>p)  =  l-p. 


This  random  variable  X  has  a  Bernoulli  distribution  with  parameter  p. 

Quick  exercise  6.2  A  random  variable  Y  has  outcomes  1,  3,  and  4  with  the 
following  probabilities:  P(F  =  1)  =  3/5,  P(F  =  3)  =  1/5,  and  P(y  =  4)  = 
1/5.  Describe  how  to  construct  Y  from  a  17(0, 1)  random  variable. 


6.2  Generating  realizations  of  random  variables 
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Suppose  we  have  the  distribution  function  F  of  a,  continuous  random  variable 
and  we  wish  to  construct  a  random  variable  with  this  distribution.  We  show 
how  to  do  this  if  F  is  strictly  increasing  from  0  to  1  on  an  interval.  In  that 
case  F  has  an  inverse  function  Figure  6.1  shows  an  example:  F  is  strictly 
increasing  on  the  interval  [2, 10];  the  inverse  F'*""  is  a  function  from  the  interval 
[0, 1]  to  the  interval  [2, 10]. 


2  F“'^(u)  X  10 


Fig.  6.1.  Simulating  a  continuous  random  variable  using  the  distribution  function. 


Note  how  u  relates  to  F™''{u)  as  F{x)  relates  to  x.  We  see  that  u  <  F{x) 
is  equivalent  with  F™^{u)  <  x.  If  instead  of  a  real  number  u  we  consider  a 
F(0, 1)  random  variable  F,  we  obtain  that  the  corresponding  events  are  the 
same: 

{U  <F{x)}  =  <x}.  (6.1) 

We  know  about  the  F(0,1)  random  variable  U  that  P(F  <  b)  =  b  for  any 
number  0  <  6  <  1.  Substituting  b  =  F{x)  we  see 

P(F  <  F{x))  =  F{x). 

From  equality  (6.1),  therefore, 

P(F“''(F)  <  a;)  =  F(x); 

in  other  words,  the  random  variable  F™'"(F)  has  distribution  function  F. 
What  remains  is  to  find  the  function  F'"'^.  From  Figure  6.1  we  see 

F{x)=u  ^  x  =  F“''(u), 

so  if  we  solve  the  equation  F(x)  =  u  for  x,  we  obtain  the  expression  for 
F“''(u). 
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Exponential  random  variables 

We  apply  this  method  to  the  exponential  distribution.  On  the  interval  [0,  oo), 
the  Exp{X)  distribution  function  is  strictly  increasing  and  given  by 

F{x)  =  l-e-^“. 

To  find  we  solve  the  equation  F{x)  =  u: 


F{x)  =  U 

1  -  =  u 

=  l-u 

—Xx  =  ln(l  —  u) 

X  =  —  u), 

A 

SO  F™''(m)  =  —  j  ln(l  — u)  and  if  U  has 

a  17(0, 1)  distribution,  then  the  random 

variable  X  defined  by 

X  =  E“''(C/)  =  -iln(l-C7) 
A 


has  an  Exp{X)  distribution. 

In  practice,  one  replaces  1  —  U  with  U,  because  both  have  a  U{0, 1)  distribution 
(see  Exercise  6.3).  Leaving  out  the  subtraction  leads  to  more  efficient  computer 
code.  So  instead  of  X  we  may  use 

r  =  -iln(C/), 

which  also  has  an  Exp{X)  distribution. 

Quick  exercise  6.3  A  distribution  function  F  is  0  for  a;  <  1  and  1  for  x  >  3, 
and  F{x)  =  j{x  —  1)^  if  1  <  a;  <  3.  Let  17  be  a  17(0,1)  random  variable. 
Construct  a  random  variable  with  distribution  F  from  17. 

Remark  6.1  (The  general  case).  The  restriction  we  imposed  earlier, 
that  the  distribution  function  should  be  strictly  increasing,  is  not  really 
necessary.  Furthermore,  a  distribution  function  with  jumps  or  a  flat  section 
somewhere  in  the  middle  is  not  a  problem  either.  We  illustrate  this  with  an 
example  in  Figure  6.2. 

This  F  has  a  jump  at  4  and  so  for  a  corresponding  X  we  should  have 
P(A  =  4)  =  0.2,  the  size  of  the  jump.  We  see  that  whenever  U  is  in  the 
interval  [0.3,  0.5],  it  is  mapped  to  4  by  our  method,  and  that  this  happens 
with  exactly  the  right  probability! 

The  flat  section  of  F  between  7  and  8  seems  to  pose  a  problem:  the  equa¬ 
tion  F{a)  =  0.85  has  as  its  solution  any  a  between  7  and  8,  and  we  can¬ 
not  define  a  unique  inverse.  This,  however,  does  not  really  matter,  because 
P(17  =  0.85)  =  0,  and  we  can  define  the  inverse  F'“''(0.85)  in  any  way  we 
want.  Taking  the  left  endpoint,  here  the  number  7,  agrees  best  with  the 
definition  of  quantiles  (see  page  66). 


6.3  Comparing  two  jury  rules 
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Remark  6.2  (Existence  of  random  variables).  The  previous  remark 
supplies  a  sketchy  argument  for  the  fact  that  any  nondecreasing,  rightcon- 
tinuous  function  F,  with  lima;^_oo  F{x)  =  0  and  lima^^oo  F(x)  =  1,  is  the 
distribution  of  some  random  variable. 

Generating  sequences 

For  simulations  we  often  want  to  generate  realizations  for  a  large  number  of 
random  variables.  Random  number  generators  have  been  designed  with  this 
purpose  in  mind:  each  new  call  mimics  a  new  C/(0,1)  random  variable.  The 
sequence  of  numbers  thus  generated  is  considered  as  a  realization  of  a  sequence 
of  C/(0, 1)  random  variables  U2,  C/3,. .  .with  the  special  property  that  the 
events  {Ui  <  ai}  are  independent^  for  every  choice  of  the  a^. 


6.3  Comparing  two  jury  rules 

At  the  Olympic  Games  there  are  several  sports  events  that  are  judged  by  a 
jury,  including  gymnastics,  figure  skating,  and  ice  dancing.  During  the  2002 
winter  games  a  dispute  arose  concerning  the  gold  medal  in  ice  dancing:  there 
were  allegations  that  the  Russian  team  had  bribed  a  French  jury  member, 
thereby  causing  the  Russian  pair  to  win  just  ahead  of  the  Canadians.  We  look 
into  operating  rules  for  juries,  although  we  leave  the  effects  of  bribery  to  the 
exercises  (Exercise  6.11). 

Suppose  we  have  a  jury  of  seven  members,  and  for  each  performance  each 
juror  assigns  a  grade.  The  seven  grades  are  to  be  transformed  into  a  final 
score.  Two  rules  to  do  this  are  under  consideration,  and  we  want  to  choose 

^  In  Chapter  9  we  return  to  the  question  of  independence  between  random  variables. 
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the  better  one.  For  the  first  one,  the  highest  and  lowest  scores  are  removed 
and  the  final  score  is  the  average  of  the  remaining  five.  For  the  second  rule, 
the  scores  are  put  in  ascending  order  and  the  middle  one  is  assigned  as  final 
score.  Before  you  continue  reading,  consider  which  rule  is  better  and  how  you 
can  verify  this. 


A  probabilistic  model 

For  our  investigation  we  assume  that  the  scores  the  jurors  assign  deviate  by 
some  random  amount  from  the  true  or  deserved  score.  We  model  the  score 
that  juror  i  assigns  when  the  performance  deserves  a  score  g  by 

Y,  =  g  +  Zi  fori  =  l,...,7,  (6.2) 


where  Zi, . . . ,  Zy  are  random  variables  with  values  around  zero.  Let  hi  and 
/i2  be  functions  implementing  the  two  rules: 

hi(j/i, . . . , yr)  =  average  of  the  middle  five  of  yi, . . .  ,yr, 

/i2(yi, . . . ,  j/y)  =  middle  value  of  yi, . . . ,  ?/y. 


We  are  interested  in  deviations  from  the  deserved  score  g: 

T=hi{Yi,...,Yr)-g, 

M=h2{Yi,...,Yr)-g. 


(6.3) 


The  distributions  of  T  and  M  depend  on  the  individual  jury  grades,  and 
through  those,  on  the  juror-deviations  Zi,  Z2,  ■  ■  ■ ,  Z^,  which  we  model  as 
C/(— 0.5, 0.5)  variables.  This  more  or  less  finishes  the  modeling  phase:  we  have 
given  a  stochastic  model  that  mimics  the  workings  of  a  jury  and  have  defined, 
in  terms  of  the  variables  in  the  model,  the  random  variables  T  and  M  that 
represent  the  errors  that  result  after  application  of  the  jury  rules. 

In  any  serious  application,  the  model  should  be  validated.  This  means  that 
one  tries  to  gather  evidence  to  convince  oneself  and  others  that  the  model 
adequately  reflects  the  workings  of  the  real  system.  In  this  chapter  we  are 
more  interested  in  showing  what  you  can  do  with  simulation  once  you  have  a 
model,  so  we  skip  the  validation. 

The  next  phase  is  analysis:  which  of  the  deviations  is  closer  to  zero?  Because 
T  and  M  are  random  variables,  we  would  have  to  clarify  what  we  mean  by 
that,  and  answering  the  question  certainly  involves  computing  probabilities 
about  T  and  M.  We  cannot  do  this  with  what  we  have  learned  so  far,  but  we 
know  how  to  simulate,  so  this  is  what  we  do. 


Simulation 

To  generate  a  realization  of  a  {/(— 0.5,  0.5)  random  variable,  we  only  need  to 
subtract  0.5  from  the  result  we  obtain  from  a  call  to  the  random  generator. 


6.3  Comparing  two  jury  rules 
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We  do  this  7  times  and  insert  the  resulting  values  in  (6.2)  as  jury  deviations 
Zi,  . . . ,  Zy,  and  substitute  them  in  equations  (6.3)  to  obtain  T  and  M  (the 
value  of  g  is  irrelevant:  it  drops  out  of  the  calculation): 

T  =  average  of  the  middle  five  of  Zi, . . . ,  Zy, 

(6.4 

M  =  middle  value  of  Zy, . . . ,  Zy. 

In  simulation  terminology,  this  is  called  a  run',  we  have  gone  through  the  whole 
procedure  once,  inserting  realizations  for  the  random  variables.  If  we  repeat 
the  whole  procedure,  we  have  a  second  run;  see  Table  6.1  for  the  results  of 
five  runs. 


Table  6.1.  Simulation  results  for  the  two  jury  rules. 


Run 

Zi 

Z2 

Zs 

Z4 

Zs 

Ze 

Zy 

T 

M 

1 

-0.45 

-0.08 

-0.38 

0.11 

-0.42 

0.48 

0.02 

-0.15 

-0.08 

2 

-0.37 

-0.18 

0.05 

-0.10 

0.01 

0.28 

0.31 

0.01 

0.01 

3 

0.08 

0.07 

0.47 

-0.21 

-0.33 

-0.22 

-0.48 

-0.12 

-0.21 

4 

0.24 

0.08 

-0.11 

0.19 

-0.03 

0.02 

0.44 

0.10 

0.08 

5 

0.10 

0.18 

-0.39 

-0.24 

-0.36 

-0.25 

0.20 

-0.11 

-0.24 

Quick  exercise  6.4  The  next  realizations  for  Zi,. ..,  Zy  are:  —0.05,  0.26, 
0.25,  0.39,  0.22,  0.23,  0.13.  Determine  the  corresponding  realizations  of  T 
and  M. 

Table  6.1  can  be  used  to  check  some  computations.  We  also  see  that  the  real¬ 
ization  of  T  was  closest  to  zero  in  runs  3  and  5,  the  realization  of  M  was  closest 
to  zero  in  runs  1  and  4,  and  they  were  (about)  the  same  in  run  2.  There  is  no 
clear  conclusion  from  this,  and  even  if  there  was,  one  could  wonder  whether 
the  next  five  runs  would  yield  the  same  picture.  Because  the  whole  process 
mimics  randomness,  one  has  to  expect  some  variation — or  perhaps  a  lot.  In 
later  chapters  we  will  get  a  better  understanding  of  this  variation;  for  the 
moment  we  just  say  that  judgment  based  on  a  large  number  of  runs  is  better. 
We  do  one  thousand  runs  and  exchange  the  table  for  pictures.  Figure  6.3  de¬ 
picts,  for  juror  1,  a  histogram  of  all  the  deviations  from  the  true  score  g.  For 
each  interval  of  length  0.05  we  have  counted  the  number  of  runs  for  which  the 
deviation  of  juror  1  fell  in  that  interval.  These  numbers  vary  from  about  40 
to  about  60. 

This  is  just  to  get  an  idea  about  the  results  for  an  individual  juror.  In  Fig¬ 
ure  6.4  we  see  histograms  for  the  final  scores.  Comparing  the  histograms,  it 
seems  that  the  realizations  of  T  are  more  concentrated  near  zero  than  those 
of  M. 
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-0.4  -0.2  0.0  0.2  0.4 

Fig.  6.3.  Deviations  of  juror  1  from  the  deserved  score,  one  thousand  runs. 


-0.4  -0.2  0.0  0.2  0.4  -0.4  -0.2  0.0  0.2  0.4 

T  M 

Fig.  6.4.  One  thousand  realizations  of  T  and  M. 


However,  the  two  histograms  do  not  tell  us  anything  about  the  relation  be¬ 
tween  T  and  M,  so  we  plot  the  realizations  of  pairs  (T,  M)  for  all  one  thousand 
runs  (Figure  6.5).  From  this  plot  we  see  that  in  most  cases  M  and  T  go  in 
the  same  direction:  if  T  is  positive,  then  usually  M  is  also  positive,  and  the 
same  goes  for  negative  values.  In  terms  of  the  final  scores,  both  rules  generally 
overvalue  and  undervalue  the  performance  simultaneously.  On  closer  exami¬ 
nation,  with  help  of  the  line  drawn  from  (—0.5,  —0.5)  to  (0.5,  0.5),  we  see  that 
the  T  values  tend  to  be  a  little  closer  to  zero  than  the  M  values. 

This  suggests  that  we  make  a  histogram  that  shows  the  difference  of  the 
absolute  deviations  from  true  score.  For  rule  1  this  absolute  deviation  is  |r|, 
for  rule  2  it  is  |M|.  If  the  difference  \M\  —  |r|  is  positive,  then  T  is  closer  to 
zero  than  M,  and  the  difference  tells  us  by  how  much.  A  negative  difference 
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- 1 - 1 - 1 - 1 - 1 - 1 - 1 - r 

-0.4  -0.2  0.0  0.2  0.4 
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Fig.  6.5.  Plot  of  the  points  {T,M),  one  thousand  runs. 


means  that  M  was  closer.  In  Figure  6.6  all  the  differences  are  shown  in  a 
histogram.  The  bars  to  the  right  of  zero  represent  696  runs.  So,  in  about  70% 
of  the  runs,  rule  1  resulted  in  a  final  score  that  is  closer  to  the  true  score  than 
rule  2.  In  about  30%  of  the  cases,  rule  2  was  better,  but  generally  by  a  smaller 
amount,  as  we  see  from  the  histogram. 


-0.3  -0.2  -0.1  0.0  0.1  0.2  0.3 


Fig.  6.6.  Differences  \M\  —  |r|  for  one  thousand  runs. 
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6.4  The  single-server  queue 

There  are  many  situations  in  life  where  you  stand  in  a  line  waiting  for  some 
service:  when  you  want  to  withdraw  money  from  a  cash  dispenser,  borrow 
books  at  the  library,  be  admitted  to  the  emergency  room  at  the  hospital,  or 
pump  gas  at  the  gas  station.  Many  other  queueing  situations  are  hidden:  an 
email  message  you  send  might  be  queued  at  the  local  server  until  it  has  sent 
all  messages  that  were  submitted  ahead  of  yours;  searching  the  Internet,  your 
browser  sends  and  receives  packets  of  information  that  are  queued  at  various 
stages  and  locations;  in  assembly  lines,  partly  finished  products  move  from 
station  to  station,  each  time  waiting  for  the  next  component  to  be  added. 

We  are  going  to  study  one  simple  queueing  model,  the  so-called  single-server 
queue:  it  has  one  server  or  service  mechanism,  and  the  arriving  customers 
await  their  turn  in  order  of  their  arrival.  For  definiteness,  think  of  an  oasis 
with  one  big  water  well.  People  arrive  at  the  well  with  bottles,  jerry  cans,  and 
other  types  of  containers,  to  pump  water.  The  supply  of  water  is  large,  but 
the  pump  capacity  is  limited.  The  pump  is  about  to  be  replaced,  and  while  it 
is  clear  that  a  larger  pump  capacity  will  result  in  shorter  waiting  times,  more 
powerful  pumps  are  also  more  expensive.  Therefore,  to  prepare  a  decision  that 
balances  costs  and  benefits,  we  wish  to  investigate  the  relationship  between 
pump  capacity  and  system  performance. 

Modeling  the  system 

A  stochastic  model  is  in  order:  some  general  characteristics  are  known,  such 
as  how  many  people  arrive  per  day  and  how  much  water  they  take  on  average, 
but  the  individual  arrival  times  and  amounts  are  unpredictable.  We  introduce 
random  variables  to  describe  them:  let  Ti  be  the  time  between  the  start  at 
time  zero  and  the  arrival  of  the  first  customer,  T2  the  time  between  the  arrivals 
of  the  first  and  the  second  customer,  T3  the  time  between  the  second  and  the 
third,  etc.;  these  are  called  the  interarrival  times.  Let  Si  be  the  length  of  time 
that  customer  i  needs  to  use  the  pump;  in  standard  terminology  this  is  called 
the  service  time.  This  is  our  description  so  far: 

Arrivals  at:  Ti  Ti  +T2  Ti  +T2  +  T3  etc. 

Service  times:  Si  S2  S3  etc. 

The  pump  capacity  v  (liters  per  minute)  is  not  a  random  variable  but  a  model 
parameter  or  decision  variable,  whose  “best”  value  we  wish  to  determine.  So 
if  customer  i  requires  Ri  liters  of  water,  then  her  service  time  is 


To  complete  the  model  description,  we  need  to  specify  the  distribution  of  the 
random  variables  R  and  Rf. 
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Interarrival  times:  every  Ti  has  an  Ex'p{Q.b)  distribution  (minutes); 

Service  requirement:  every  Ri  has  a  U(2,5)  distribution  (liters). 

This  particular  choice  of  distributions  would  have  to  be  supported  by  evidence 
that  they  are  suited  for  the  system  at  hand:  a  validation  step  as  suggested  for 
the  jury  model  is  appropriate  here  as  well.  For  many  arrival  type  processes, 
however,  the  exponential  distribution  is  reasonable  as  a  model  for  the  inter¬ 
arrival  times  (see  Chapter  12).  The  particular  uniform  distribution  chosen  for 
the  required  amount  of  water  says  that  all  amounts  between  2  and  5  liters  are 
equally  likely.  So  there  is  no  sheik  who  owns  a  5000-liter  water  truck  in  “our” 
oasis. 

To  evaluate  system  performance,  we  want  to  extract  from  the  model  the  wait¬ 
ing  times  of  the  customers  and  how  busy  it  is  at  the  pump. 


Waiting  times 

Let  Wi  denote  the  waiting  time  of  customer  i.  The  first  customer  is  lucky; 
the  system  starts  empty,  and  so  Wi  =  0.  For  customer  i  the  waiting  time 
depends  on  how  long  customer  i—1  spent  in  the  system  compared  to  the  time 
between  their  respective  arrivals.  We  see  that  if  the  interarrival  time  R  is  long, 
relatively  speaking,  then  customer  i  arrives  after  the  departure  of  customer 
i  —  1,  and  so  Wi  =  0: 


Wi-i  - -  Si-1 

-  Ti 


Wi  =  0 


- 1 - 

Arrival  of 

customer  i—1 


Departure  of  Arrival  of 

customer  i—1  customer  i 


On  the  other  hand,  if  customer  i  arrives  before  the  departure,  the  waiting 
time  Wi  equals  whatever  remains  of  Wi-i  +  Si-i: 


^  Wi-i  -  Si-1 


Ti 


Wi  - > 


Wi  =  Wi-i  +  Si-1  -  Ti 


Arrival  of  Arrival  of  Departure  of 

customer  i—1  customer  i  customer  i  —  1 


Summarizing  the  two  cases,  we  see  obtain: 

Wi  =  max{  Wi_i  -k  -  Ti,  0}.  (6.5) 

To  carry  out  a  simulation,  we  start  at  time  zero  and  generate  realizations  of 
the  interarrival  times  (the  Ti)  and  service  requirements  (the  Ri)  for  as  long 
as  we  want,  computing  the  other  quantities  that  follow  from  the  model  on  the 
way.  Table  6.2  shows  the  values  generated  this  way,  for  two  pump  capacities 
{v  =  2  and  3)  for  the  first  six  customers.  Note  that  in  both  cases  we  use  the 
same  realizations  of  R  and  Ri. 
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Table  6.2.  Results  of  a  short  simulation. 


Input  realizations 

V  ■ 

=  2 

V  = 

:  3 

i 

Ti 

Arr.time 

Ri 

Si 

Wi 

Si 

Wi 

1 

0.24 

0.24 

4.39 

2.20 

0 

1.46 

0 

2 

1.97 

2.21 

4.00 

2.00 

0.23 

1.33 

0 

3 

1.73 

3.94 

2.33 

1.17 

0.50 

0.78 

0 

4 

2.82 

6.76 

4.03 

2.01 

0 

1.34 

0 

5 

1.01 

7.77 

4.17 

2.09 

1.00 

1.39 

0.33 

6 

1.09 

8.86 

4.24 

2.12 

1.99 

1.41 

0.63 

Quick  exercise  6.5  The  next  four  realizations  are  Tr:  1.86;  Rr:  4.79;  Tg: 
1.08;  and  Rs'-  2.33.  Complete  the  corresponding  rows  of  the  table. 

Longer  simulations  produce  so  many  numbers  that  we  will  drown  in  them 
unless  we  think  of  something.  First,  we  summarize  the  waiting  times  of  the 
first  n  customers  with  their  average: 

W,+W2  +  ---  +  Wn 

Wn  = - .  (6.6) 

n 

Then,  instead  of  giving  a  table,  we  plot  the  pairs  (n,  Wn),  for  n=  1,2,...  until 
the  end  of  the  simulation.  In  Figure  6.7  we  see  that  both  lines  bounce  up  and 
down  a  bit.  Toward  the  end,  the  average  waiting  time  for  pump  capacity  3  is 
about  0.5  and  for  u  =  2  about  2.  In  a  longer  simulation  we  would  see  each  of 
the  averages  converge  to  a  limiting  value  (a  consequence  of  the  so-called  law 
of  large  numbers,  the  topic  of  Chapter  13). 


Fig.  6.7.  Averaged  waiting  times  at  the  well,  for  pump  capacity  2  and  3. 
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Work-in-system 

To  show  how  busy  it  is  at  the  pump  one  could  record  how  many  customers  are 
waiting  in  the  queue  and  plot  this  quantity  against  time.  A  slightly  different 
approach  is  to  record  at  every  moment  how  much  work  there  is  in  the  system, 
that  is,  how  much  time  it  would  take  to  serve  everyone  present  at  that  moment. 
For  example,  if  I  am  halfway  through  filling  my  4-liter  jerry  can  and  three 
persons  are  waiting  who  require  2,  3,  and  5  liters,  respectively,  then  there  are 
12  liters  to  go;  at  u  =  2,  there  is  6  minutes  of  work  in  the  system,  and  at 
u  =  3  just  4. 

The  amount  of  work  in  the  system  just  before  a  customer  arrives  equals  the 
waiting  time  of  that  customer,  because  it  is  exactly  the  time  it  takes  to  finish 
the  work  for  everybody  ahead  of  her.  The  work-in-system  at  time  t  tells  us 
how  long  the  wait  would  be  if  somebody  were  to  arrive  at  t.  For  this  reason, 
this  quantity  is  also  called  the  virtual  waiting  time. 

Figure  6.8  shows  the  work-in-system  as  a  function  of  time  for  the  first  15 
minutes,  using  the  same  realizations  that  were  the  basis  for  Table  6.2.  In  the 
top  graph,  corresponding  to  u  =  2,  the  work  in  the  system  jumps  to  2.20 
(which  is  the  realization  of  Ri/2)  at  t  =  0.24,  when  the  first  customer  arrives. 
So  at  t  =  2.21,  which  is  1.97  later,  there  is  2.20  —  1.97  =  0.23  minute  of  work 
left;  this  is  the  waiting  time  for  customer  2,  who  brings  an  amount  of  work 
of  2.00  minutes,  so  the  peak  at  1.97  is  0.23  -I-  2.00  =  2.23,  etc.  In  the  bottom 
graph  we  see  the  work-in-system  reach  zero  more  often,  because  the  individual 
(work)  amounts  are  2/3  of  what  they  are  when  v  =  2.  More  often,  arriving 


5  -1 


Fig.  6.8.  Work  in  system:  top,  v  =  2;  bottom,  n  =  3. 
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Fig.  6.9.  Work  in  system:  top,  u  =  2;  bottom,  u  =  3. 


customers  find  the  queue  empty  and  the  pump  not  in  use;  they  do  not  have 
to  wait. 

In  Figure  6.9  the  work-in-system  is  depicted  as  a  function  of  time  for  the 
first  100  minutes  of  our  run.  At  pump  capacity  2  the  virtual  waiting  time 
peaks  at  close  to  11  minutes  after  about  55  minutes,  whereas  with  u  =  3  the 
corresponding  peak  is  only  about  4  minutes.  There  also  is  a  marked  difference 
in  the  proportion  of  time  the  system  is  empty. 


6.5  Solutions  to  the  quick  exercises 

6.1  To  simulate  the  coin,  choose  any  three  of  the  six  possible  outcomes  of 
the  die,  report  heads  if  one  of  these  three  outcomes  turns  up,  and  report  tails 
otherwise.  For  example,  heads  if  the  outcome  is  odd,  tails  if  it  is  even. 

To  simulate  the  die  using  a  coin  is  more  difficult;  one  solution  is  as  follows. 
Toss  the  coin  three  times  and  use  the  following  conversion  table  to  map  the 
result: 


Coins 

HHH 

HHT 

HTH 

HTT 

THH 

THT 

Die 

1 

2 

3 

4 

5 

6 

Repeat  the  coin  tosses  if  you  get  TTH  or  TTT. 
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6.2  Let  the  f7(0, 1)  variable  be  U  and  set: 

[1 

Y=h 

So,  for  example,  P(y  =  3)  =  P(|  <  Cf  <  |)  =  |. 

6.3  The  given  distribution  function  F  is  strictly  increasing  between  1  and  3, 
so  we  use  the  method  with  y™''.  Solve  the  equation  F{x)  =  |(x  —  1)^  =  u 
for  X.  This  yields  a;  =  1  +  2^/u,  so  we  can  set  X  =  1  +  2v^.  If  you  need  to 
be  convinced,  determine  Fx- 

6.4  In  ascending  order  the  values  are  —0.05,0.13,0.22,0.23,0.25,0.26,0.39, 
so  for  M  we  find  0.23,  and  for  T  (0.13  +  0.22  +  0.23  +  0.25  +  0.26) /5  =  0.22. 

6.5  We  find: 


Input  realizations 

V  - 

=  2 

V  = 

=  3 

i 

T, 

Arr.time 

Ri 

s. 

Wi 

s^ 

Wi 

7 

1.86 

10.72 

4.79 

2.39 

2.25 

1.60 

0.18 

8 

1.08 

11.80 

2.33 

1.16 

3.57 

0.78 

0.70 

6.6  Exercises 

6.1  Let  U  have  a  1/(0, 1)  distribution. 

a.  Describe  how  to  simulate  the  outcome  of  a  roll  with  a  die  using  U. 

b.  Define  Y  as  follows:  round  6C/  +  1  down  to  the  nearest  integer.  What  are 
the  possible  outcomes  of  Y  and  their  probabilities? 

6.2  □  We  simulate  the  random  variable  X  =  1  +  2\/U  constructed  in  Quick 

exercise  6.3.  As  realization  for  U  we  obtain  from  the  pseudo  random  generator 

the  number  u  =  0.3782739. 

a.  What  is  the  corresponding  realization  x  of  the  random  variable  X7 

b.  If  the  next  call  to  the  random  generator  yields  u  =  0.3,  will  the  corre¬ 
sponding  realization  for  X  be  larger  or  smaller  than  the  value  you  found 
in  a? 

c.  What  is  the  probability  the  next  draw  will  be  smaller  than  the  value  you 
found  in  a? 
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6.3  Let  U  have  a  U{0, 1)  distribution.  Show  that  Z  =  1  —  U  has  a  U{0, 1) 
distribution  by  deriving  the  probability  density  function  or  the  distribution 
function. 

6.4  Let  F  be  the  distribution  function  as  given  in  Quick  exercise  6.3:  F(x) 
is  0  for  a;  <  1  and  1  for  x  >  3,  and  F{x)  =  j{x  —  1)^  if  1  <  a:  <  3.  In  the 
answer  it  is  claimed  that  X  =  1  +  2\/U  has  distribution  function  F^  where  U 
is  a  C/(0, 1)  random  variable.  Verify  this  by  computing  P(V  <  a)  and  checking 
that  this  equals  f  (a),  for  any  a. 

6.5  ffl  We  have  seen  that  if  U  has  a  f/(0, 1)  distribution,  then  X  =  —hiU  has 
an  Exp{\)  distribution.  Check  this  by  verifying  that  P(V  <  a)  =  1  —  e““  for 
a  >  0. 

6.6  □  Somebody  messed  up  the  random  number  generator  in  your  computer: 
instead  of  uniform  random  numbers  it  generates  numbers  with  an  Exp  (2)  dis¬ 
tribution.  Describe  how  to  construct  a  C/(0,1)  random  variable  U  from  an 
Exp  (2)  distributed  X. 

Hint:  look  at  how  you  obtain  an  Exp  (2)  random  variable  from  a  U{0, 1)  ran¬ 
dom  variable. 

6.7  ffl  In  models  for  the  lifetimes  of  mechanical  components  one  sometimes 
uses  random  variables  with  distribution  functions  from  the  so-called  Weibull 
family.  Here  is  an  example:  F(x)  =  0  for  x  <  0,  and 

F{x)  =  1  -  for  x  >  0. 

Construct  a  random  variable  Z  with  this  distribution  from  a  U{0, 1)  variable. 

6.8  A  random  variable  X  has  a  Par{3)  distribution,  so  with  distribution  func¬ 
tion  F  with  F{x)  =  0  for  x  <  1,  and  F{x)  =  1  —  x~^  for  x  >  1.  For  details  on 
the  Pareto  distribution  see  Section  5.4.  Describe  how  to  construct  X  from  a 
f/(0, 1)  random  variable. 

6.9  □  In  Quick  exercise  6.1  we  simulated  a  die  by  tossing  three  coins.  Recall 
that  we  might  need  several  attempts  before  succeeding. 

a.  What  is  the  probability  that  we  succeed  on  the  first  try? 

b.  Let  N  be  the  number  of  tries  that  we  need.  Determine  the  distribution 
of  N. 

6.10  ffl  There  is  usually  more  than  one  way  to  simulate  a  particular  random 
variable.  In  this  exercise  we  consider  two  ways  to  generate  geometric  random 
variables. 

a.  We  give  you  a  sequence  of  independent  f7(0, 1)  random  variables  Ui,  C/2, 
....  From  this  sequence,  construct  a  sequence  of  Bernoulli  random  vari- 
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ables.  From  the  sequence  of  Bernoulli  random  variables,  construct  a  (sin¬ 
gle)  Geo{p)  random  variable. 

b.  It  is  possible  to  generate  a  Geo{p)  random  variable  using  just  one  U{0, 1) 
random  variable.  If  calls  to  the  random  number  generator  take  a  lot  of 
CPU  time,  this  would  lead  to  faster  simulation  programs.  Set  A  =  —  ln(l  — 
p)  and  let  Y  have  a  Exp{\)  distribution.  We  obtain  Z  from  Y  by  rounding 
to  the  nearest  integer  greater  than  Y .  Note  that  Z  is  a  discrete  random 
variable,  whereas  U  is  a  continuous  one.  Show  that,  nevertheless,  the  event 
{Z  >  n}  is  the  same  as  {Y  >  n}.  Use  this  to  compute  P(Z  >  n)  from  the 
distribution  of  Y.  What  is  the  distribution  of  Z?  (See  Quick  exercise  4.6.) 

6.11  Reconsider  the  jury  example  (see  Section  6.3).  Suppose  the  first  jury 
member  is  bribed  to  vote  in  favor  of  the  present  candidate. 

a.  How  should  you  now  model  Yi?  Describe  how  you  can  investigate  which 
of  the  two  rules  is  less  sensitive  to  the  effect  of  the  bribery. 

b.  The  International  Skating  Union  decided  to  adopt  a  rule  similar  to  the 
following:  randomly  discard  two  of  the  jury  scores,  then  average  the  re¬ 
maining  scores.  Describe  how  to  investigate  this  rule.  Do  you  expect  this 
rule  to  be  more  sensitive  to  the  bribery  than  the  two  rules  already  dis¬ 
cussed,  or  less  sensitive? 

6.12  ffl  A  tiny  financial  model.  To  investigate  investment  strategies,  con¬ 
sider  the  following: 

You  can  choose  to  invest  your  money  in  one  particular  stock  or  put  it  in  a 
savings  account.  Your  initial  capital  is  €  1000.  The  interest  rate  r  is  0.5%  per 
month  and  does  not  change.  The  initial  stock  price  is  €  100.  Your  stochastic 
model  for  the  stock  price  is  as  follows:  next  month  the  price  is  the  same  as 
this  month  with  probability  1/2,  with  probability  1/4  it  is  5%  lower,  and  with 
probability  1/4  it  is  5%  higher.  This  principle  applies  for  every  new  month. 
There  are  no  transaction  costs  when  you  buy  or  sell  stock. 

Your  investment  strategy  for  the  next  5  years  is:  convert  all  your  money  to 
stock  when  the  price  drops  below  €  95,  and  sell  all  stock  and  put  the  money 
in  the  bank  when  the  stock  price  exceeds  €  110. 

Describe  how  to  simulate  the  results  of  this  strategy  for  the  model  given. 

6.13  We  give  you  an  unfair  coin  and  you  do  not  know  P(i?)  for  this  coin.  Can 
you  simulate  a  fair  coin,  and  how  many  tosses  do  you  need  for  each  fair  coin 
toss? 
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Expectation  and  variance 


Random  variables  are  complicated  objects,  containing  a  lot  of  information 
on  the  experiments  that  are  modeled  by  them.  If  we  want  to  summarize  a 
random  variable  by  a  single  number,  then  this  number  should  undoubtedly 
be  its  expected  value.  The  expected  value,  also  called  the  expectation  or  mean, 
gives  the  center — in  the  sense  of  average  value — of  the  distribution  of  the 
random  variable.  If  we  allow  a  second  number  to  describe  the  random  variable, 
then  we  look  at  its  variance,  which  is  a  measure  of  spread  of  the  distribution 
of  the  random  variable. 


7.1  Expected  values 

An  oil  company  needs  drill  bits  in  an  exploration  project.  Suppose  that  it  is 
known  that  (after  rounding  to  the  nearest  hour)  drill  bits  of  the  type  used 
in  this  particular  project  will  last  2,  3,  or  4  hours  with  probabilities  0.1,  0.7, 
and  0.2.  If  a  drill  bit  is  replaced  by  one  of  the  same  type  each  time  it  has  worn 
out,  how  long  could  exploration  be  continued  if  in  total  the  company  would 
reserve  10  drill  bits  for  the  exploration  job?  What  most  people  would  do  to 
answer  this  question  is  to  take  the  weighted  average 


0.1-2  +  0.7-3  +  0.2-4  =  3.1, 

and  conclude  that  the  exploration  could  continue  for  10  x  3.1,  or  31  hours. 
This  weighted  average  is  what  we  call  the  expected  value  or  expectation  of  the 
random  variable  X  whose  distribution  is  given  by 

P(V  =  2)  =  0.1,  P(X  =  3)  =  0.7,  P(A  =  4)  =  0.2. 

It  might  happen  that  the  company  is  unlucky  and  that  each  of  the  10  drill  bits 
has  worn  out  after  two  hours,  in  which  case  exploration  ends  after  20  hours. 
At  the  other  extreme,  they  may  be  lucky  and  drill  for  40  hours  on  these  10 
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bits.  However,  it  is  a  mathematical  fact  that  the  conclusion  about  a  31-hour 
total  drilling  time  is  correct  in  the  following  sense:  for  a  large  number  n  of 
drill  bits  the  total  running  time  will  be  around  n  times  3.1  hours  with  high 
probability.  In  the  example,  where  n  =  10,  we  have,  for  instance,  that  drilling 
will  continue  for  29,  30,  31,  32,  or  33  hours  with  probability  more  than  0.86, 
while  the  probability  that  it  will  last  only  for  20,  21,  22,  23,  or  24  hours  is  less 
than  0.00006.  We  will  come  back  to  this  in  Chapters  13  and  14.  This  example 
illustrates  the  following  definition. 


Definition.  The  expectation  of  a  discrete  random  variable  X  taking 
the  values  01,02,...  and  with  probability  mass  function  p  is  the 
number 

E  [X]  =  ^  OiP(X  =  Oi)  =  ^  Oip(oi). 

i  i 


We  also  call  E  [Jf]  the  expected  value  or  mean  of  X.  Since  the  expectation  is 
determined  by  the  probability  distribution  of  X  only,  we  also  speak  of  the 
expectation  or  mean  of  the  distribution. 

Quick  exercise  7.1  Let  X  be  the  discrete  random  variable  that  takes  the 
values  1,  2,  4,  8,  and  16,  each  with  probability  1/5.  Compute  the  expectation 
of  X. 

Looking  at  an  expectation  as  a  weighted  average  gives  a  more  physical  in¬ 
terpretation  of  this  notion,  namely  as  the  center  of  gravity  of  weights  p{ai) 
placed  at  the  points  ai.  For  the  random  variable  associated  with  the  drill  bit, 
this  is  illustrated  in  Figure  7.1. 


Fig.  7.1.  Expected  value  as  center  of  gravity. 
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This  point  of  view  also  leads  the  way  to  how  one  should  define  the  expected 
value  of  a  continuous  random  variable.  Let,  for  example,  X  be  a  continuous 
random  variable  whose  probability  density  function  /  is  zero  outside  the  in¬ 
terval  [0,1].  It  seems  reasonable  to  approximate  X  by  the  discrete  random 
variable  Y ,  taking  the  values 

12  1  1 
5  5  •  •  •  5  5  ^ 

n  n  n 


with  as  probabilities  the  masses  that  X  assigns  to  the  intervals 


rfc-i  ^1 
L  n  ’  n  -I 


p  y  = 


=  p 


k  —  1  k 

-  <  X  <  - 

n  n 


nkjn 


f  {k—l)/n 


f{x)  dx. 


We  have  a  good  idea  of  the  size  of  this  probability.  For  large  n,  it  can  be 
approximated  well  in  terms  of  /: 


P 


pkjn 

J  kj  n—1  jn 


f{x) dx 


The  “center-of-gravity”  interpretation  suggests  that  the  expectation  E[y]  of 
Y  should  approximate  the  expectation  E[X]  of  X.  We  have 


Em  =  E 


FlY  = 


!:-/(-) 


k^l 


1 

n 


By  the  definition  of  a  definite  integral,  for  large  n  the  right-hand  side  is  close 
to 

/*! 


/  xf{x)dx. 


Jo 

This  motivates  the  following  definition. 


Definition.  The  expectation  of  a  continuous  random  variable  X 
with  probability  density  function  /  is  the  number 


/oo 

xf{x)  dx. 

-OO 


We  also  call  E  [X]  the  expected  value  or  mean  of  X.  Note  that  E  [X]  is  indeed 
the  center  of  gravity  of  the  mass  distribution  described  by  the  function  /: 


E[X] 


x/(x)  dx 


iZc  ^/(^) 

/E  fi^)  da;  ' 


This  is  illustrated  in  Figure  7.2. 
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Fig.  7.2.  Expected  value  as  center  of  gravity,  continuous  case. 


Quick  exercise  7.2  Compute  the  expectation  of  a  random  variable  U  that 
is  uniformly  distributed  over  [2,5]. 


Remark  7.1  (The  expected  value  may  not  exist!).  In  the  definitions 
in  this  section  we  have  been  rather  careless  about  the  convergence  of  sums 
and  integrals.  Let  us  take  a  closer  look  at  the  integral  I  =  xf{x)  Ax. 
Since  a  probability  density  function  cannot  take  negative  values,  we  have 
I  =  I~  +  with  I~  —  xf{x)  da;  a  negative  and  xf{x)  da;  a 

positive  number.  However,  it  may  happen  that  I~  equals  — oo  or  equals 
+CX3.  If  both  I~  =  —oo  and  Z"*"  =  +oo,  then  we  say  that  the  expected 
value  does  not  exist.  An  example  of  a  continuous  random  variable  for  which 
the  expected  value  does  not  exist  is  the  random  variable  with  the  Cauchy 
distribution  (see  also  page  161),  having  probability  density  function 

/(®)  =  -  oo  <  X  <  oo. 

7r(l  +  x'^) 

Eor  this  random  variable 


7+ 


r 


7r(l  +  a;2) 

7r(l  +  a;2)  ^ 


27r 


ln(l  +  x^) 


OO 

0 


+00, 


^ln(l  +  a;^) 


— oo. 


If  I~  is  finite  but  7^  =  +oo,  then  we  say  that  the  expected  value  is  infinite. 
A  distribution  that  has  an  inhnite  expectation  is  the  Pareto  distribution 
with  parameter  a  =  1  (see  Exercise  7.11).  The  remarks  we  made  on  the 
integral  in  the  definition  of  E  [A]  for  continuous  X  apply  similarly  to  the 
sum  in  the  definition  of  E[A]  for  discrete  random  variables  X. 
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7.2  Three  examples 

The  geometric  distribution 

If  you  buy  a  lottery  ticket  every  week  and  you  have  a  chance  of  1  in  10  000 
of  winning  the  jackpot,  what  is  the  expected  number  of  weeks  you  have  to 
buy  tickets  before  you  get  the  jackpot?  The  answer  is:  10  000  weeks  (almost 
two  centuries!).  The  number  of  weeks  is  modeled  by  a  random  variable  with 
a  geometric  distribution  with  parameter  p  =  10“^. 


The  expectation  of  a  geometric  distribution.  Let  X  have 
a  geometric  distribution  with  parameter  p;  then 

OO  ^ 

p 

I — 1  ^ 


Here  —  p)^~^  =  1/p  follows  from  the  formula  = 

1/(1  —  x)^  that  has  been  derived  in  your  calculus  course.  We  will  see  a  simple 
(probabilistic)  way  to  obtain  the  value  of  this  sum  in  Chapter  11. 

The  exponential  distribution 

In  Section  5.6  we  considered  the  chemical  reactor  example,  where  the  residence 
time  T,  measured  in  minutes,  has  an  Exp  (0.5)  distribution.  We  claimed  that 
this  implies  that  the  mean  time  a  particle  stays  in  the  vessel  is  2  minutes. 
More  generally,  we  have  the  following. 


The  expectation  of  an  exponential  distribution.  Let  X 
have  an  exponential  distribution  with  parameter  A;  then 

E[X]  = 


The  integral  has  been  determined  in  your  calculus  course  (with  the  technique 
of  integration  by  parts). 

The  normal  distribution 

Here,  using  that  the  normal  density  integrates  to  1  and  applying  the  substi¬ 
tution  z  =  {x  —  p,)/ a, 

/OO  1  —xf  /  ^oo  1  —xf 

x^^e  H  -  J  dx  =  p+  /  ix-p)^^e  H  -  /  dx 

-OO  crv^  7-00 

=  f  z  _ e  2  dz  =  pL, 

V  277 


—  OO 
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where  the  integral  is  0,  because  the  integrand  is  an  odd  function.  We  obtained 
the  following  rule. 


The  expectation  of  a  normal  distribution.  Let  X  be  an 
distributed  random  variable.  Then 


7.3  The  change-of- variable  formula 


Often  one  does  not  want  to  compute  the  expected  value  of  a  random  variable 
X  but  rather  of  a  function  of  X,  as,  for  example,  X'^ .  We  then  need  to  deter¬ 
mine  the  distribution  of  T  =  for  example  by  computing  the  distribution 
function  Fy  of  Y  (this  is  an  example  of  the  general  problem  of  how  distribu¬ 
tions  change  under  transformations — this  topic  is  the  subject  of  Chapter  8). 
For  a  concrete  example,  suppose  an  architect  wants  maximal  variety  in  the 
sizes  of  buildings:  these  should  be  of  the  same  width  and  depth  X,  but  X  is 
uniformly  distributed  between  0  and  10  meters.  What  is  the  distribution  of 
the  area  X^  of  a  building;  in  particular,  will  this  distribution  be  (anything 
near  to)  uniform?  Let  us  compute  Fy;  for  0  <  a  <  100: 


Fy{a)  =  P{X^  <a)=  P{X  = 


Hence  the  probability  density  function  fy  of  the  area  is,  for  0  <  y  <  100 
meters  squared,  given  by 


fviy) 


d  yy 

dy  10 


1 


(7.1) 


This  means  that  the  buildings  with  small  areas  are  heavily  overrepresented, 
because  fy  explodes  near  0 — see  also  Figure  7.3,  in  which  we  plotted  fy. 
Surprisingly,  this  is  not  very  visible  in  Figure  7.4,  an  example  where  we  should 
believe  our  calculations  more  than  our  eyes.  In  the  figure  the  locations  of 
the  buildings  are  generated  by  a  Poisson  process,  the  subject  of  Chapter  12. 
Suppose  that  a  contractor  has  to  make  an  offer  on  the  price  of  the  foundations 
of  the  buildings.  The  amount  of  concrete  he  needs  will  be  proportional  to  the 
area  X'^  of  a  building.  So  his  problem  is:  what  is  the  expected  area  of  a 
building?  With  fy  from  (7.1)  he  finds 


E[X^] 


nlOO 


'12  3 

- y  2 

20  3^ 


100 


=  33j  m^. 


0 


0 
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Fig.  7.3.  The  probability  density  of  the  square  of  a  f/(0, 10)  random  variable. 


It  is  interesting  to  note  that  we  really  need  to  do  this  calculation,  because 
the  expected  area  is  not  simply  the  product  of  the  expected  width  and  the 
expected  depth,  which  is  25  m^.  However,  there  is  a  much  easier  way  in  which 
the  contractor  could  have  obtained  this  result.  He  could  have  argued  that 
the  value  of  the  area  is  when  x  is  the  width,  and  that  he  should  take  the 
weighted  average  of  those  values,  where  the  weight  at  width  x  is  given  by  the 
value  fx{x)  of  the  probability  density  of  X.  Then  he  would  have  computed 


/oo  ^10  1 

x^fx{x)dx  =  • — 


dx  = 


30 


10 


=  33i  m^ 


J  0 


It  is  indeed  a  mathematical  theorem  that  this  is  always  a  correct  way  to 
compute  expected  values  of  functions  of  random  variables. 


10 

J 


Fig.  7.4.  Top:  widths  of  the  buildings  between  0  and  10  meters.  Bottom:  corre¬ 
sponding  buildings  in  a  100x300  m  area. 
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The  change-of- variable  formula.  Let  X  be  a  random  variable, 
and  let  g  :  M  — >  M  be  a  function. 

If  X  is  discrete,  taking  the  values  oi,  02, . . . ,  then 


E[g{X)]=J29{a^)P{X  =  a,). 


If  X  is  continuous,  with  probability  density  function  /,  then 


■00 


E[5W]=  /  g{x)f{x)dx. 


—  00 


Quick  exercise  7.3  Let  X  have  a  Ber{p)  distribution.  Compute  E  [2^] . 

An  operation  that  occurs  very  often  in  practice  is  a  change  of  units,  e.g.,  from 
Fahrenheit  to  Celsius.  What  happens  then  to  the  expectation?  Here  we  have 
to  apply  the  formula  with  the  function  g(x)  =  rx  +  s,  where  r  and  s  are 
real  numbers.  When  X  has  a  continuous  distribution,  the  change-of-variable 
formula  yields: 


E  [rX  +  s] 


r-E  [A]  +  s. 


A  similar  computation  with  integrals  replaced  by  sums  gives  the  same  result 
for  discrete  random  variables. 


7.4  Variance 

Suppose  you  are  offered  an  opportunity  for  an  investment  whose  expected 
return  is  €500.  If  you  are  given  the  extra  information  that  this  expected 
value  is  the  result  of  a  50%  chance  of  a  €450  return  and  a  50%  chance  of  a 
€550  return,  then  you  would  not  hesitate  to  spend  €450  on  this  investment. 
However,  if  the  expected  return  were  the  result  of  a  50%  chance  of  a  €  0  return 
and  a  50%  chance  of  a  €  1000  return,  then  most  people  would  be  reluctant  to 
spend  such  an  amount.  This  demonstrates  that  the  spread  (around  the  mean) 
of  a  random  variable  is  of  great  importance.  Usually  this  is  measured  by  the 
expected  squared  deviation  from  the  mean. 

Definition.  The  variance  Var(A)  of  a  random  variable  X  is  the 
number 


Var(A)  =  E[(A-E[A])2]  . 
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Note  that  the  variance  of  a  random  variable  is  always  positive  (or  0).  Fur¬ 
thermore,  there  is  the  question  of  existence  and  finiteness  (cf.  Remark  7.1). 
In  practical  situations  one  often  considers  the  standard  deviation  defined  by 
y^Var(X),  because  it  has  the  same  dimension  as  E[X]. 

As  an  example,  let  us  compute  the  variance  of  a  normal  distribution.  If  X  has 
an  distribution,  then: 


Var(X)  =  E[(A-E[X])2] 


oo 

—  oo 

2  / 


{x  -  nf 


OO  -I 

^ 


1  / 


T-\/27r 


e  ^  dz. 


dx 


Here  we  substituted  z  =  {x  —  Using  integration  by  parts  one  finds  that 

1  ,  1 
e  2^  (\z  =  1. 


We  have  found  the  following  property. 


Variance  of  a  normal  distribution.  Let  X  be  an  A(/i,  tr^) 
distributed  random  variable.  Then 

/OO  1  _  1  x  —  ^  ^ 

(x  — ^  V  /  dec  =  cr^. 

-OO  ^  \ 

Quick  exercise  7.4  Let  us  call  the  two  returns  discussed  above  Yi  and  I2, 
respectively.  Compute  the  variance  and  standard  deviation  of  Yi  and  Y2 . 

It  is  often  not  practical  to  compute  Var(V)  directly  from  the  definition,  but 
one  uses  the  following  rule. 


An  alternative  expression  for  the  variance.  For  any  ran¬ 
dom  variable  X, 

Var(V)  =  E[v2]  -(E[V])^ 


To  see  that  this  rule  holds,  we  apply  the  change-of-variable  formula.  Sup¬ 
pose  V  is  a  continuous  random  variable  with  probability  density  function  / 
(the  discrete  case  runs  completely  analogously).  Using  the  change-of-variable 
formula,  well-known  properties  of  the  integral,  and  f{x)  dx  =  1,  we  find 


98  7  Expectation  and  variance 

Var(X)  =  E[(X-E[X])2] 

{x-E[X])^f{x)dx 


f 


{x^  -  2a;E  [X]  +  (E  [X])^)  f{x)  dx 

) 

/OO  poo  poo 

a;^/(x)  dec  —  2E  [X]  /  x/(a:)  dec  +  (E  [X])^  /  /(er)dec 

-OO  J  —OO  J  —OO 

=  E  [X^]  -  2(E  [X])2  +  (E  [X]f 
=  E[x2]  -  (E[X])2. 


With  this  rule  we  make  two  steps:  first  we  compute  E[X],  then  we  compute 
E  [X^] .  The  latter  is  called  the  second  moment  of  X.  Let  us  compare  the 
computations,  using  the  definition  and  this  rule  for  the  drill  bit  example. 
Recall  that  for  this  example  X  takes  the  values  2,  3,  and  4  with  probabilities 
0.1,  0.7,  and  0.2.  We  found  that  E[X]=  3.1.  According  to  the  definition 

Var(X)  =  E  [(X  -  3.1)^]  =  0.1  •  (2  -  3.1)^  +  0.7  •  (3  -  3.1)^  +  0.2  •  (4  -  3.1)^ 

=  0.1  •  (-1.1)^  +  0.7  •  (-0.1)2  Q  2  .  (0.9)2 
=  0.1 -1.21 +  0.7 -0.01 +  0.2 -0.81 
=  0.121  +  0.007+  0.162 
=  0.29. 


Using  the  rule  is  neater  and  somewhat  faster: 

Var(X)  =  E  [X2]  -  (3.1)2  ^  q.I  .  22  +  0.7  •  32  +  0.2  •  42  -  9.61 

=  0.1 -4  + 0.7 -9  +  0.2 -16 -9.61 
=  0.4  +  6.3  +  3.2-9.61 
=  0.29. 


What  happens  to  the  variance  if  we  change  units?  At  the  end  of  the  pre¬ 
vious  section  we  showed  that  E[rX  +  s]  =  rE[X]  +  s.  This  can  be  used  to 
obtain  the  corresponding  rule  for  the  variance  under  change  of  units  (see  also 
Exercise  7.15). 


Expectation  and  variance  under  change  of  units.  For  any 
random  variable  X  and  any  real  numbers  r  and  s, 

E  [rX  +  s]  =  rE  [X]  +  s,  and  Var(rX  +  s)  =  r‘^Yax{X) . 

Note  that  the  variance  is  insensitive  to  the  shift  over  s.  Can  you  understand 
why  this  must  be  true  without  doing  any  computations? 
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7.5  Solutions  to  the  quick  exercises 


7.1  We  have 

E  [X]  =  'y  ^  atP^X  =  tti)  =  1-  —  +  2-  —  +  4-  —  +  8-  —  +  16-  —  =  —  =  6.2. 

0  0  0  0  0  0 

I 


7.2  The  probability  density  function  /  of  t/  is  given  by  f{x)  =  0  outside  [2, 5] 
and  f{x)  =  1/3  for  2  <  X  <  5;  hence 


E[U]  = 


xf{x) dx 


7.3  Using  the  change-of-variable  formula  we  obtain 


E[2^]  =^2“*P(X  =  ai) 

i 

=  2°  •  P{X  =  0)  +  2^  •  P{X  =  1) 

=  l-{l-p)  +  2-  p=  l-  p  +  2p=l+p. 

You  could  also  have  noted  that  Y  =  2^^  has  a  distribution  given  by  P(Y  =  1)  = 
1  —  p,  P(Y  =  2)  =  p-  hence 

E  [2^]  =  E  [Y]  =  1  •  P(Y  =  1)  +  2  •  P(Y  =  2)  =  1  •  (1  -  p)  +  2  •  p  =  1  +  p. 

7.4  We  have 

Var(Yi)  =  i(450  -  500)^  +  i(550  -  500)^  =  50^  =  2500, 
so  Yi  has  standard  deviation  €  50  and 

Var(Y2)  =  i(0  -  500)^  +  i(1000  -  500)^  =  500^  =  250  000, 
so  Y2  has  standard  deviation  €500. 


7.6  Exercises 

7.1  □  Let  T  be  the  outcome  of  a  roll  with  a  fair  die. 

a.  Describe  the  probability  distribution  of  T,  that  is,  list  the  outcomes  and 
the  corresponding  probabilities. 

b.  Determine  E[r]  and  Var(r). 

7.2  □  The  probability  distribution  of  a  discrete  random  variable  X  is  given 

by 

P(Y  =  -l)  =  i,  P(Y  =  0)  =  |,  P(X  =  1)  =  |. 
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a.  Compute  E  [X]. 

b.  Give  the  probability  distribution  of  F  =  and  compute  E  \Y]  using  the 
distribution  of  Y. 

c.  Determine  E  using  the  change-of-variable  formula.  Check  your  an¬ 
swer  against  the  answer  in  b. 

d.  Determine  Var(X). 

7.3  For  a  certain  random  variable  X  it  is  known  that  E  [X]  =  2,  Var(X)  =  3. 
What  is  E  [X2]  ? 

7.4  Let  X  be  a  random  variable  with  E[X]  =  2,  Var(X)  =  4.  Compute  the 
expectation  and  variance  of  3  —  2X. 

7.5  □  Determine  the  expectation  and  variance  of  the  Ber(p)  distribution. 

7.6  ffl  The  random  variable  Z  has  probability  density  function  f{z)  =  3z^/19 
for  2  <  z  <  3  and  f{z)  =  0  elsewhere.  Determine  Y[Z].  Before  you  do  the 
calculation:  will  the  answer  lie  closer  to  2  than  to  3  or  the  other  way  around? 

7.7  Given  is  a  random  variable  X  with  probability  density  function  /  given 
by  f{x)  =  0  for  a:  <  0,  and  for  a:  >  1,  and  f{x)  =  4a:  —  4a:^  for  0  <  a:  <  1. 
Determine  the  expectation  and  variance  of  the  random  variable  2X  -|-  3. 

7.8  □  Given  is  a  continuous  random  variable  X  whose  distribution  function 
F  satisfies  F{x)  =  0  for  a:  <  0,  F{x)  =  1  for  a:  >  1,  and  F{x)  =  x{2  —  x)  for 
0  <  X  <  1.  Determine  E[X]. 

7.9  Let  U  he  a  random  variable  with  a  U{a,P)  distribution. 

a.  Determine  the  expectation  of  U. 

b.  Determine  the  variance  of  U. 

7.10  □  Let  X  have  an  exponential  distribution  with  parameter  A. 

a.  Determine  E[X]  and  E  [X^]  using  partial  integration. 

b.  Determine  Var(X). 

7.11  □  In  this  exercise  we  take  a  look  at  the  mean  of  a  Pareto  distribution. 

a.  Determine  the  expectation  of  a  Par  (2)  distribution. 

b.  Determine  the  expectation  of  a  Par(i)  distribution. 

c.  Let  X  have  a  Par{a)  distribution.  Show  that  E[X]  =  a/(a  —  1)  if  a  >  1. 

7.12  For  which  a  is  the  variance  of  a  Par{a)  distribution  finite?  Compute  the 
variance  for  these  a. 
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7.13  Remember  that  we  found  on  page  95  that  the  expected  area  of  a  building 
was  33^  m^,  whereas  the  square  of  the  expected  width  was  only  25  m^.  This 
phenomenon  is  more  general:  show  that  for  any  random  variable  X  one  has 
E[X2]  >  (E[X])^ 

Hint:  you  might  use  that  Var(X)  >  0. 

7.14  Suppose  we  choose  arbitrarily  a  point  from  the  square  with  corners  at 
(2,1),  (3,1),  (2,2),  and  (3,2).  The  random  variable  A  is  the  area  of  the  triangle 
with  its  corners  at  (2,1),  (3,1),  and  the  chosen  point.  (See  also  Exercise  5.9 
and  Figure  7.5.)  Compute  E[^]. 

(2,2)  (3,2) 


Fig.  7.5.  A  triangle  in  a  1x1  square. 

7.15  ffl  Let  X  be  a  random  variable  and  r  and  s  any  real  numbers.  Use  the 
change-of-units  rule  E  [rX  +  s]  =  rE  [X]  +  s  for  the  expectation  to  obtain  a 
and  b. 

a.  Show  that  Var(rX)  =  r^Var(X). 

b.  Show  that  Var(X  +  s)  =  Var(X). 

c.  Combine  parts  a  and  b  to  show  that 

Var(rX  +  s)  =  r^Var(X) . 

7.16  □  The  probability  density  function  /  of  the  random  variable  X  used 
in  Figure  7.2  is  given  by  f{x)  =  0  outside  (0,1)  and  f{x)  =  — 4a:ln(a;)  for 
0  <  a;  <  1.  Compute  the  position  of  the  balancing  point  in  the  figure,  that  is, 
compute  the  expectation  of  X . 

7.17  ffl  Let  t/  be  a  discrete  random  variable  taking  the  values  oi, . . . ,  with 

probabilities  pi , . . . ,  • 

a.  Suppose  all  >  0,  but  that  E[[/]=0.  Show  then 
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(l\  —  CZ2  —  *  ■  ■  —  —  0- 

In  other  words;  P(f/  =  0)  =  1. 

b.  Suppose  that  P  is  a  random  variable  taking  the  values  bi,...,br  with 
probabilities  pi, .  ■  ■  ,Pr-  Show  that  Var(P)  =  0  implies 

P(I/  =  E[P])  =  1. 

Hint:  apply  a  with  U  =  (P  —  E  [P])^. 
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Computations  with  random  variables 


There  are  many  ways  to  make  new  random  variables  from  old  ones.  Of  course 
this  is  not  a  goal  in  itself;  usually  new  variables  are  created  naturally  in 
the  process  of  solving  a  practical  problem.  The  expectations  and  variances 
of  such  new  random  variables  can  be  calculated  with  the  change-of-variable 
formula.  However,  often  one  would  like  to  know  the  distributions  of  the  new 
random  variables.  We  shall  show  how  to  determine  these  distributions,  how 
to  compare  expectations  of  random  variables  and  their  transformed  versions 
(Jensen’s  inequality),  and  how  to  determine  the  distributions  of  maxima  and 
minima  of  several  random  variables. 


8.1  Transforming  discrete  random  variables 

The  problem  we  consider  in  this  section  and  the  next  is  how  the  distribution 
of  a  random  variable  X  changes  if  we  apply  a  function  (7  to  it,  thus  obtaining 
a  new  random  variable  Y : 

Y  =  g{X). 

When  V  is  a  discrete  random  variable  this  is  usually  not  too  hard  to  do:  it 
is  just  a  matter  of  bookkeeping.  We  illustrate  this  with  an  example.  Imagine 
an  airline  company  that  sells  tickets  for  a  flight  with  150  available  seats.  It 
has  no  idea  about  how  many  tickets  it  will  sell.  Suppose,  to  keep  the  example 
simple,  that  the  number  X  of  tickets  that  will  be  sold  can  be  anything  from  1 
to  200.  Moreover,  suppose  that  each  possibility  has  equal  probability  to  occur, 
i.e.,  P(V  =  j)  =  1/200  for  j  =  1,2, .. .  ,200.  The  real  interest  of  the  airline 
company  is  in  the  random  variable  Y,  which  is  the  number  of  passengers  that 
have  to  be  refused.  What  is  the  distribution  of  V?  To  answer  this,  note  that 
nobody  will  be  refused  when  the  passengers  fit  in  the  plane,  hence 

p(y  =  0)=P(X<150)  =  |5  =  | 
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For  the  other  values,  k  =  1,  2  . . . ,  50 


P(y  =  k)  =  P{X  =  150 +  fc) 


1 


Note  that  in  this  example  the  function  g  is  given  by  g{x)  =  max{x  —  150,  0}. 

Quick  exercise  8.1  Let  Z  be  the  number  of  passengers  that  will  be  in  the 
plane.  Determine  the  probability  distribution  of  Z .  What  is  the  function  g  in 
this  case? 


8.2  Transforming  continuons  random  variables 


We  now  turn  to  continuous  random  variables.  Since  single  values  occur  with 
probability  zero  for  a  continuous  random  variable,  the  approach  above  does 
not  work.  The  strategy  now  is  to  first  determine  the  distribution  function  of 
the  transformed  random  variable  Y  =  g{X)  and  then  the  probability  density 
by  differentiating.  We  shall  illustrate  this  with  the  following  example  (actually 
we  saw  an  example  of  such  a  computation  in  Section  7.3  with  the  function 
g{x)  =  x^). 

We  consider  two  methods  that  traffic  police  employ  to  determine  whether 
you  deserve  a  fine  for  speeding.  From  experience,  the  traffic  police  think  that 
vehicles  are  driving  at  speeds  ranging  from  60  to  90  km/hour  at  a  certain 
road  section  where  the  speed  limit  is  80  km/hour.  They  assume  that  the 
speed  of  the  cars  is  uniformly  distributed  over  this  interval.  The  first  method 
is  measuring  the  speed  at  a  fixed  spot  in  the  road  section.  With  this  method 
the  police  will  find  that  about  (90  —  80) /(90  —  60)  =  1/3  of  the  cars  will  be 
fined. 


For  the  second  method,  cameras  are  put  at  the  beginning  and  end  of  a  1-km 
road  section,  and  a  driver  is  fined  if  he  spends  less  than  a  certain  amount  of 
time  in  the  road  section.  Cars  driving  at  60  km/hour  need  one  minute,  those 
driving  at  90  km/hour  only  40  seconds.  Let  us  therefore  model  the  time  T 
an  arbitrary  car  spends  in  the  section  by  a  uniform  distribution  over  (40,60) 
seconds.  What  is  the  speed  V  we  deduce  from  this  travelling  time?  Note  that 
for  40  <  f  <  60, 


P(T  <  t) 


t-40 

20 


Since  there  are  3600  seconds  in  an  hour  we  have  that 


^  =  9{T) 


3600 

T 


We  therefore  find  for  the  distribution  function  Fv{v)  =  V{V  <  v)  of  the 
speed  V  that 
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Fv{v)  =  P 


=  P  T  > 


(3600/t;)  -40 
20 


for  all  speeds  v  between  60  and  90.  We  can  now  obtain  the  probability  density 
fv  of  V  by  differentiating: 


fv{v) 


180 


V 


2 


for  60  <  <  90. 

It  is  amusing  to  note  that  with  the  second  model  the  traffic  police  write  fewer 
speeding  tickets  because 

P(P  >  80)  =  1  -  P(P  <  80)  =  1  -  (3  -  ^)  =  i. 

(With  the  first  model  we  found  probability  1/3  that  a  car  drove  faster  than 
80  km/hour.)  This  is  related  to  a  famous  result  in  road  traffic  research,  which 
is  succinctly  phrased  as:  “space  mean  speed  <  time  mean  speed”  (see  [37]). 
It  is  also  related  to  Jensen’s  inequality,  which  we  introduce  in  Section  8.3. 

Similar  to  the  way  this  is  done  in  the  traffic  example,  one  can  determine 
the  distribution  of  P  =  1/W  for  any  X  with  a  continuous  distribution.  The 
outcome  will  be  that  if  X  has  density  fx,  then  the  density  fy  of  Y  is  given 
by 

fviy)  =  -^Fy^y)  =  ^fx  (-)  for  y  <  0  and  y  >  0. 
dy  y^  \y/ 

One  can  give  /y(0)  any  value;  often  one  puts  fy{0)  =  0. 

Quick  exercise  8.2  Let  X  have  a  continuous  distribution  with  probability 
density  fx{x)  =  l/[7r(l  +  x^)].  What  is  the  distribution  of  P  =  1/X? 

We  turn  to  a  second  example.  A  very  common  transformation  is  a  change  of 
units,  for  instance,  from  Celsius  to  Fahrenheit.  If  X  is  temperature  expressed 
in  degrees  Celsius,  then  P  =  |A+32  is  the  temperature  in  degrees  Fahrenheit. 
Let  Fx  and  Fy  be  the  distribution  functions  of  X  and  P.  Then  we  have  for 
any  a 


Fy(a)  =  P(P  <  a)  =  P  -X  +  32  <  a 

5 


=  P  A  < 


i(«-32))  =Px(^(a-32)). 


By  differentiating  Fy  (using  the  chain  rule),  we  obtain  the  probability  density 
fy{y)  =  |/x(|(y  —  32)).  We  can  do  this  for  more  general  changes  of  units, 
and  we  obtain  the  following  useful  rule. 
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Change-OF-units  transformation.  Let  X  be  a  continuous  ran¬ 
dom  variable  with  distribution  function  Fx  and  probability  density 
function  fx  ■  If  we  change  units  toY  =  rX  +  s  for  real  numbers  r  >  0 
and  s,  then 

Friy)  =  Fx  and  friy)  =  \fx  ■ 


As  an  example,  let  X  be  a  random  variable  with  an  distribution, 

and  let  Y  =  rX  -1-  s.  Then  this  rule  gives  us 

fviy)  =  -fx 

r  \  r  J  raV^TT 

for  —oo<y<oo.  On  the  right-hand  side  we  recognize  the  probability  density 
of  a  normal  distribution  with  parameters  rfx  +  s  and  r^a^.  This  illustrates  the 
following  rule. 


Normal  random  variables  under  change  of  units.  Let  X 
be  a  random  variable  with  an  X(/i,  cr^)  distribution.  For  any  r  ^ 
0  and  any  s,  the  random  variable  rX  -|-  s  has  an  N{r^  +  s,r^a‘^) 
distribution. 


Note  that  if  X  has  an  X(/r,  a^)  distribution,  then  with  r  =  1/a  and  s  =  —y-ja 
we  conclude  that 


has  an  X(0, 1)  distribution.  As  a  consequence 

Fx{a)  =  P(X  <  a)  =  V{aZ  +  y<  a)  = 

So  any  probability  for  an  X(/i,  a^)  distributed  random  variable  X  can  be 
expressed  in  terms  of  an  X(0, 1)  distributed  random  variable  Z . 

Quick  exercise  8.3  Compute  the  probabilities  P(X  <  5)  and  P(X  >  2)  for 
X  with  an  X(4,  25)  distribution. 


< 


a  —  y 
a 


=  $ 


a  —  y 


8.3  Jensen’s  ineqnality 

Without  actually  computing  the  distribution  of  g{X)  we  can  often  tell  how 
E[g(X)]  relates  to  (;(E[X]).  For  the  change-of-units  transformation  g{x)  = 
rx  +  s  we  know  that  E[g(X)]  =  5(E[X])  (see  Section  7.3).  It  is  a  common 
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error  to  equate  these  two  sides  for  other  functions  g.  In  fact,  equality  will  very 
rarely  occur  for  nonlinear  g. 

For  example,  suppose  that  a  company  that  produces  microelectronic  parts 
has  a  target  production  of  240  chips  per  day,  but  the  yield  has  only  been  40, 
60,  and  80  chips  on  three  consecutive  days.  The  average  production  over  the 
three  days  then  is  60  chips,  so  on  average  the  production  should  have  been 
4  times  higher  to  reach  the  target.  However,  one  can  also  look  at  this  in  the 
following  way:  on  the  three  days  the  production  should  have  been  240/40  =  6, 
240/60  =  4,  and  240/80  =  3  times  higher.  On  average  that  is 

i(6  +  4+3)  =  f  =  4.3333 

times  higher!  What  happens  here  can  be  explained  (take  for  X  the  part  of  the 
target  production  that  is  realized,  where  you  give  equal  probabilities  to  the 
three  outcomes  1/6,  1/4,  and  1/3)  by  the  fact  that  if  X  is  a  random  variable 
taking  positive  values,  then  always 


1 


<  E 


1 

X 


unless  Var(X)  =  0,  which  only  happens  if  X  is  not  random  at  all  (cf.  Exer¬ 
cise  7.17).  This  inequality  is  the  case  g{x)  =  1/x  on  (0,oo)  of  the  following 
result  that  holds  for  general  convex  functions  g. 


Jensen’s  inequality.  Let  g  be  a  convex  function,  and  let  X  be 
a  random  variable.  Then 

g{E[X])<E[g{X)]. 


Recall  from  calculus  that  a  twice  differentiable  function  g  is  convex  on  an 
interval  I  if  g"{x)  >  0  for  all  x  in  /,  and  strictly  convex  if  g"{x)  >  0  for 
all  X  in  I.  When  X  takes  its  values  in  an  interval  /  (this  can,  for  instance, 
be  /  =  (—00,00)),  and  g  is  strictly  convex  on  /,  then  strict  inequality  holds: 
(l(E[X])  <  E[(7(X)],  unless  X  is  not  random. 

In  Figure  8.1  we  illustrate  the  way  in  which  this  result  can  be  obtained  for 
the  special  case  of  a  random  variable  X  that  takes  two  values,  a  and  b.  In  the 
figure,  X  takes  these  two  values  with  probability  3/4  and  1/4  respectively. 
Convexity  of  g  forces  any  line  segment  connecting  two  points  on  the  graph  of 
g  to  lie  above  the  part  of  the  graph  between  these  two  points.  So  if  we  choose 
the  line  segment  from  {a,g{a))  to  {b,g{b)),  then  it  follows  that  the  point 

(E[X]  ,E[5f(X)])  =  (|a-k  |g(a)  -k  ^ib))  =  ^{a,g{a))  +  j{b,g{b)) 

on  this  line  lies  “above”  the  point  (E[X]  ,5(E[X])  on  the  graph  of  g.  Hence 
E[g(X)]  >g(E[X]). 
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Fig.  8.1.  Jensen’s  inequality. 


A  simple  example  is  given  by  g{x)  =  x^.  This  function  is  convex  {g"{x)  =  2 
for  all  x),  and  hence 

(E[A])2  <  E[x2]  . 

Note  that  this  is  exactly  the  same  as  saying  that  Var(X)  >  0,  which  we  have 
already  seen  in  Section  7.4. 

Quick  exercise  8.4  Let  A  be  a  random  variable  with  Var(A)  >  0.  Which 
is  true:  E  [e“^]  <  or  E  [e“^]  > 


8.4  Extremes 

In  many  situations  the  maximum  (or  minimum)  of  a  sequence  Ai ,  X2 ,  . . . ,  A„ 
of  random  variables  is  the  variable  of  interest.  For  instance,  let  Ai,A2, 
. . . ,  A365  be  the  water  level  of  a  river  during  the  days  of  a  particular  year 
for  a  particular  location.  Suppose  there  will  be  flooding  if  the  level  exceeds  a 
certain  height — usually  the  height  of  the  dykes.  The  question  whether  flood¬ 
ing  occurs  during  a  year  is  completely  answered  by  looking  at  the  maximum 
of  Ai,  A2,  . . . ,  A365.  If  one  wants  to  predict  occurrence  of  flooding  in  the  fu¬ 
ture,  the  probability  distribution  of  this  maximum  is  of  great  interest.  Similar 
models  arise,  for  instance,  when  one  is  interested  in  possible  damage  from  a 
series  of  shocks  or  in  the  extent  of  a  contamination  plume  in  the  subsurface. 

We  want  to  And  the  distribution  of  the  random  variable 

Z  =  max{Ai,  A2, . . . ,  A„}. 

We  can  determine  the  distribution  function  of  Z  by  realizing  that  the  maxi¬ 
mum  of  the  Xi  is  smaller  than  a  number  a  if  and  only  if  all  Xi  are  smaller 
than  a: 
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Fz{a)  =  P (Z  <  a)  =  P(max{Xi, . . . ,  X„}  <  a)  =  P(Xi  <  a, . . . ,  Xn  <  a) . 

Now  suppose  that  the  events  {Xi  <  a^}  are  independent  for  every  choice 
of  the  Qi-  In  this  case  we  call  the  random  variables  independent  (see  also 
Chapter  9,  where  we  study  independence  of  random  variables).  In  particular, 
the  events  {Xi  <  a}  are  independent  for  all  a.  It  then  follows  that 

Fz{a)  =  P(Xi  <  a, . . . , <  a)  =  P{Xi  <  a)  •  •  •  P(X„  <  a) . 

Hence,  if  all  random  variables  have  the  same  distribution  function  F,  then 
the  following  result  holds. 


The  distribution  of  the  maximum.  Let  Xi,  X2, .  ■  ■ ,  Xn  be  n 
independent  random  variables  with  the  same  distribution  function 
F,  and  let  Z  =  maxjXi,  X2, . . . ,  Xn}-  Then 

Fz{a)  =  {F{a)r. 

Quick  exercise  8.5  Let  Xi,X2,  ■ . .  ,Xn  be  independent  random  variables, 
all  with  a  U{0, 1)  distribution.  Let  Z  =  maxjXi, . . .  ,Xn}-  Compute  the  dis¬ 
tribution  function  and  the  probability  density  function  of  Z. 

What  can  we  say  about  the  distribution  of  the  minimum?  Let 

V  =  mm{Xi,X2,...,Xn}. 

We  can  now  find  the  distribution  function  Fy  of  V  by  observing  that  the 
minimum  of  the  Xi  is  larger  than  a  number  a  if  and  only  if  all  Xi  are  larger 
than  a.  The  trick  is  to  switch  to  the  complement  of  the  event  {V  <  a}: 

Fv{a)  =  P{V  <  a)  =  1  -  P(P  >  a)  =  1  -  P(min{Xi, . . .  ,X„}  >  a) 

=  1  -  P(Xi  >  a, . . . ,  >  a) . 

So  using  independence  and  switching  back  again,  we  obtain 

Fv{a)  =  l-P{Xi  >  a,...,Xn>  a)  =  l-P{Xi  >  a)  •  •  •  P(X„  >  a) 

=  1  -  (1  -  P(Xi  <  a))  •  •  •  (1  -  P{Xn  <  a)). 

We  have  found  the  following  result  for  the  minimum. 


The  distribution  of  the  minimum.  Let  Xi,X2,...,X„  be  n 
independent  random  variables  with  the  same  distribution  function 
F,  and  let  V  =  minjXi,  X2, . . . ,  Xn}-  Then 

Fv{a)  =  l-il-Fia)r- 

Quick  exercise  8.6  Let  Xi,  X2,  -  -  - ,  Xn  be  independent  random  variables, 
all  with  a  {7(0, 1)  distribution.  Let  V  =  min{7fi, . . .  ,X„}.  Compute  the  dis¬ 
tribution  function  and  the  probability  density  function  of  V- 
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8.5  Solutions  to  the  quick  exercises 


8.1  Clearly  Z  can  take  the  values  1,...,150.  The  value  150  is  special: 
the  plane  is  full  if  150  or  more  people  buy  a  ticket.  Hence  P(Z  =  150)  = 
P(X  >  150)  =  51/200.  For  the  other  values  we  have  P(Z  =  i)  =  P(X  =  i)  = 
1/200,  for  z  =  1, . . . ,  149.  Clearly,  here  g{x)  =  min{150,a:}. 


8.2  The  probability  density  of  P  =  1/X  is 

y^7r(l  +  (i)2)  7r(l  +  y2)- 

We  see  that  1/X  has  the  same  distribution  as  X\  (This  distribution  is  called 
the  standard  Cauchy  distribution,  it  will  be  introduced  in  Chapter  11.) 


8.3  First  define  Z  =  (X  — 4)/5,  which  has  an  N{0, 1)  distribution.  Then  from 
Table  B.l 


P{X  <  5)  =  P 


5-4 


P{Z  <  0.20)  =  1  -  0.4207  =  0.5793. 


Similarly,  using  the  symmetry  of  the  normal  distribution. 


P{X  >  2)  =  P 


2-4 


P{Z  >  -0.40)  =  P{Z  <  0.40)  =  0.6554. 


8.4  If  g{x)  =  e  then  g”{x)  =  e  ^  >  0;  hence  g  is  strictly  convex.  It  follows 
from  Jensen’s  inequality  that 

e-E[^]  <  E  [e-^]  . 

Moreover,  if  Var(Jf)  >  0,  then  the  inequality  is  strict. 

8.5  The  distribution  function  of  the  Xi  is  given  by  F{x)  =  x  on  [0, 1].  There¬ 
fore  the  distribution  function  Fz  of  the  maximum  Z  is  equal  to  Fz{a)  = 
{F{a))'^  =  a”.  Its  probability  density  function  is 

fz{z)  =  ^Fz{z)  =  for  0  <  z  <  1. 

8.6  The  distribution  function  of  the  W  is  given  by  F{x)  =  x  on  [0, 1].  There¬ 
fore  the  distribution  function  Fy  of  the  minimum  V  is  equal  to  Fv{a)  = 
1  —  (1  —  a)”.  Its  probability  density  function  is 

fv{v)  =  -^Fv(v)  =  nil  —  for  0  <  r:  <  1. 

dt! 
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8.6  Exercises 

8.1  □  Often  one  is  interested  in  the  distribution  of  the  deviation  of  a  random 
variable  X  from  its  mean  fjL  =  E[X].  Let  X  take  the  values  80,90, 100, 110, 
and  120,  all  with  probability  0.2;  then  E[X]  =  /r  =  100.  Determine  the  dis¬ 
tribution  of  E  =  |X  —  /i|.  That  is,  specify  the  values  Y  can  take  and  give  the 
corresponding  probabilities. 

8.2  ffl  Suppose  X  has  a  uniform  distribution  over  the  points  {1, 2,  3, 4, 5,  6} 
and  that  g{x)  =  sin(^x). 

a.  Determine  the  distribution  of  E  =  g{X)  =  sin(^X),  that  is,  specify  the 
values  E  can  take  and  give  the  corresponding  probabilities. 

b.  Let  Z  =  cos(^X).  Determine  the  distribution  of  Z. 

c.  Determine  the  distribution  of  IE  =  +  Z^ .  Warning:  in  this  example 

there  is  a  very  special  dependency  between  E  and  Z,  and  in  general  it  is 
much  harder  to  determine  the  distribution  of  a  random  variable  that  is  a 
function  of  two  other  random  variables.  This  is  the  subject  of  Chapter  11. 

8.3  □  The  continuous  random  variable  U  is  uniformly  distributed  over  [0, 1]. 

a.  Determine  the  distribution  function  of  E  =  2U  +  7.  What  kind  of  distri¬ 
bution  does  V  have? 

b.  Determine  the  distribution  function  of  E  =  rU  +  s  for  all  real  numbers 
r  >  0  and  s.  See  Exercise  8.9  for  what  happens  for  negative  r. 

8.4  Transforming  exponential  distributions. 

a.  Let  X  have  an  Exp{^)  distribution.  Determine  the  distribution  function 
of  ^X.  What  kind  of  distribution  does  ^X  have? 

b.  Let  X  have  an  Exp{X)  distribution.  Determine  the  distribution  function 
of  \X.  What  kind  of  distribution  does  XX  have? 

8.5  □  Let  X  be  a  continuous  random  variable  with  probability  density  func¬ 
tion 

/ 1^(2  ~  ^)  for  0  <  x  <  2 
^  1^0  elsewhere. 

a.  Determine  the  distribution  function  Fx- 

b.  Let  E  =  '/X.  Determine  the  distribution  function  Fy- 

c.  Determine  the  probability  density  of  E. 

8.6  Let  X  be  a  continuous  random  variable  with  probability  density  fx  that 
takes  only  positive  values  and  let  E  =  1/X. 
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a.  Determine  Fyin)  and  show  that 

fviy)  =  \  fx  (-)  for  y  >  0. 

b.  Let  Z  =  1/F.  Using  a,  determine  the  probability  density  ]z  of  Z,  in  terms 
of  !x- 

8.7  Let  X  have  a  Par(a)  distribution.  Determine  the  distribution  function  of 
In  X .  What  kind  of  a  distribution  does  In  X  have? 

8.8  □  Let  X  have  an  Exp{l)  distribution,  and  let  a  and  A  be  positive  numbers. 
Determine  the  distribution  function  of  the  random  variable 


The  distribution  of  the  random  variable  W  is  called  the  Weibull  distribution 
with  parameters  a  and  A. 

8.9  Let  X  be  a  continuous  random  variable.  Express  the  distribution  function 
and  probability  density  of  the  random  variable  Y  =  —X  in  terms  of  those  of  X. 

8.10  ffl  Let  X  be  an  X(3,4)  distributed  random  variable.  Use  the  rule  for 
normal  random  variables  under  change  of  units  and  Table  B.l  to  determine 
the  probabilities  P(X  >  3)  and  P(X  <  1). 

8.11  ffl  Let  X  be  a  random  variable,  and  let  y  be  a  twice  differentiable  function 
with  g" {x)  <  0  for  all  x.  Such  a  function  is  called  a  concave  function.  Show 
that  for  concave  functions  always 

y(E[X])>E[5(X)]. 

8.12  ffl  Let  X  be  a  random  variable  with  the  following  probability  mass  func¬ 
tion: 


0  1  100  10  000 


P(X  =  x)  i  i  \ 

a.  Determine  the  distribution  of  U  =  a/X. 

'Vx' 


b.  Which  is  larger  E 


or  v'E  [X]? 

Hint:  use  Exercise  8.11,  or  start  by  showing  that  the  function  g(x)  =  —  a/x 
is  convex. 


c.  Compute  and  E  -y/X  to  check  your  answer  (and  to  see  that  it 

makes  a  big  difference!). 


8.13  Let  W  have  a  U{TT,2n)  distribution.  What  is  larger:  E[sin(lU)]  or 
sin(E  [fU])?  Check  your  answer  by  computing  these  two  numbers. 
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8.14  In  this  exercise  we  take  a  look  at  Jensen’s  inequality  for  the  function 

g{x)  =  (which  is  neither  convex  nor  concave  on  (—00,00)). 

a.  Can  you  find  a  (discrete)  random  variable  X  with  Var(Jf)  >  0  such  that 

E[X3]  =  (E[X])3? 

b.  Under  what  kind  of  conditions  on  a  random  variable  X  will  the  inequality 
E  [X^]  >  (E[X])^  certainly  hold? 

8.15  Let  Xi,X2, . . . ,  Xn  be  independent  random  variables,  all  with  a  17(0, 1) 

distribution.  Let  Z  =  maxjXi, . . . ,  X„}  and  V  =  minjXi, . . . ,  X„}. 

a.  Compute  E [max{Xi, X2}]  and  E [min{Xi, X2}]. 

b.  Compute  E[Z]  and  E[U]  for  general  n. 

c.  Can  you  argue  directly  (using  the  symmetry  of  the  uniform  distribu¬ 
tion  (see  Exercise  6.3)  and  not  the  result  of  the  computation  in  b)  that 
1  -  E[max{Xi,...,X„}]  =E[min{Xi,...,X„}]? 

8.16  In  this  exercise  we  derive  a  kind  of  Jensen  inequality  for  the  minimum. 

a.  Let  a  and  b  be  real  numbers.  Show  that 

minja,  b}  =  —(a  +  b  —  \a  —  6|). 

b.  Let  X  and  Y  be  independent  random  variables  with  the  same  distribution 
and  finite  expectation.  Deduce  from  a  that 

E  [min{X,  r}]  =  E  [X]  -  ^E  [|X  -  F |] . 


c.  Show  that 

E  [min{X,  Y}]  <  min{E  [X]  ,  E  [Y]}. 

Remark:  this  is  not  so  interesting,  since  min{E[X]  ,E[F]}  =  E[X]  =  E[F], 
but  we  will  see  in  the  exercises  of  Chapter  11  that  this  inequality  is  also  true 
for  X  and  Y,  which  do  not  have  the  same  distribution. 

8.17  Let  Xi, . . . ,  X„  be  n  independent  random  variables  with  the  same  dis¬ 
tribution  function  F. 

a.  Convince  yourself  that  for  any  numbers  Xi, ...  ,Xn  it  is  true  that 

min{a;i, . .  .,Xn}  =  -  max{-a;i, . . . ,  -x„}. 

b.  Let  Z  =  maxjXi,  X2, . . . ,  X„}  and  V  =  minjXi,  X2, . . . ,  X„}.  Use  Exer¬ 
cise  8.9  and  the  observation  in  a  to  deduce  the  formula 
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=  1  -  (1  -  F(a))" 


directly  from  the  formula 


Fz{a)  =  {F{a)r. 

8.18  □  Let  Xi^X2,  ■  ■  ■  ,Xn  be  independent  random  variables,  all  with  an 
Exp{X)  distribution.  Let  V  =  minjXi, . . . ,  X„}.  Determine  the  distribution 
function  of  V.  What  kind  of  distribution  is  this? 

8.19  ffl  From  the  “north  pole”  iV  of  a  circle  with  diameter  1,  a  point  Q  on 
the  circle  is  mapped  to  a  point  t  on  the  line  by  its  projection  from  iV,  as 
illustrated  in  Figure  8.2. 


Suppose  that  the  point  Q  is  uniformly  chosen  on  the  circle.  This  is  the  same 
as  saying  that  the  angle  ip  is  uniformly  chosen  from  the  interval  [— §,  §]  (can 
you  see  this?).  Let  X  be  this  angle,  so  that  X  is  uniformly  distributed  over 
the  interval  [— f,f]-  This  means  that  P(X<(/9)  =  l/2  +  95/7r  (cf.  Quick 
exercise  5.3).  What  will  be  the  distribution  of  the  projection  of  Q  on  the  line? 
Let  us  call  this  random  variable  Z.  Then  it  is  clear  that  the  event  {Z  <  t}  is 
equal  to  the  event  {X  <  i^},  where  t  and  ip  correspond  to  each  other  under 
the  projection.  This  means  that  tan((^)  =  t,  which  is  the  same  as  saying  that 
arctan(t)  =  ip. 

a.  What  part  of  the  circle  is  mapped  to  the  interval  [1,  oo)? 

b.  Compute  the  distribution  function  of  Z  using  the  correspondence  between 
t  and  ip. 

c.  Compute  the  probability  density  function  of  Z. 

The  distribution  of  Z  is  called  the  Cauchy  distribution  (which  will  be  discussed 
in  Chapter  11). 
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Joint  distributions  and  independence 


Random  variables  related  to  the  same  experiment  often  influence  one  another. 
In  order  to  capture  this,  we  introduce  the  joint  distribution  of  two  or  more 
random  variables.  We  also  discuss  the  notion  of  independence  for  random 
variables,  which  models  the  situation  where  random  variables  do  not  influence 
each  other.  As  with  single  random  variables  we  treat  these  topics  for  discrete 
and  continuous  random  variables  separately. 


9.1  Joint  distributions  of  discrete  random  variables 

In  a  census  one  is  usually  interested  in  several  variables,  such  as  income,  age, 
and  gender.  In  itself  these  variables  are  interesting,  but  when  two  (or  more)  are 
studied  simultaneously,  detailed  information  is  obtained  on  the  society  where 
the  census  is  performed.  For  instance,  studying  income,  age,  and  gender  jointly 
might  give  insight  to  the  emancipation  of  women. 

Without  mentioning  it  explicitly,  we  already  encountered  several  examples  of 
joint  distributions  of  discrete  random  variables.  For  example,  in  Chapter  4  we 
defined  two  random  variables  S  and  M,  the  sum  and  the  maximum  of  two 
independent  throws  of  a  die. 

Quick  exercise  9.1  List  the  elements  of  the  event  {S'  =  7,  M  =  4}  and 
compute  its  probability. 

In  general,  the  joint  distribution  of  two  discrete  random  variables  X  and  Y , 
defined  on  the  same  sample  space  12,  is  given  by  prescribing  the  probabilities 
of  all  possible  values  of  the  pair  (A,  Y). 
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Definition.  The  joint  probability  mass  function  p  of  two  discrete 
random  variables  X  and  Y  is  the  function  p  :  — >  [0, 1],  defined  by 

p{a,  b)  =  P{X  =  a,Y  =  b)  for  —  oo  <  a,b  <  oo. 


To  stress  the  dependence  on  (X,  T),  we  sometimes  write  px,Y  instead  of  p. 

If  X  and  Y  take  on  the  values  ai,  02, . . . ,  and  61, 62,  •  ■  • ,  respectively, 
the  joint  distribution  of  X  and  Y  can  simply  be  described  by  listing  all  the 
possible  values  of  p{ai,bj).  For  example,  for  the  random  variables  S  and  M 
from  Chapter  4  we  obtain  Table  9.1. 


Table  9.1.  Joint  probability  mass  function  p{a,b)  =  P(5  =  a,M  —  b). 


6 

a 

1 

2 

3 

4 

5 

6 

2 

1/36 

0 

0 

0 

0 

0 

3 

0 

2/36 

0 

0 

0 

0 

4 

0 

1/36 

2/36 

0 

0 

0 

5 

0 

0 

2/36 

2/36 

0 

0 

6 

0 

0 

1/36 

2/36 

2/36 

0 

7 

0 

0 

0 

2/36 

2/36 

2/36 

8 

0 

0 

0 

1/36 

2/36 

2/36 

9 

0 

0 

0 

0 

2/36 

2/36 

10 

0 

0 

0 

0 

1/36 

2/36 

11 

0 

0 

0 

0 

0 

2/36 

12 

0 

0 

0 

0 

0 

1/36 

From  this  table  we  can  retrieve  the  distribution  of  S  and  of  M.  For  example, 
because 


{^  =  6}  =  {5  =  6,  M  =  1}  U  {5  =  6,  M  =  2}  U  •  •  •  U  {5  =  6,  M  =  6}, 
and  because  the  six  events 


{S' =  6,  M  =  1},  {S' =  6,  M  =  2}, . . . ,  {S' =  6,  M  =  6} 


are  mutually  exclusive,  we  find  that 


ps(6)  =P(S  =  6) 


P(S  =  6,  M  =  1)  +  •  •  •  +  P(S  =  6,  M  =  6) 


p(6,l)+p(6,2)  +  ...+p(6,6) 


2 


2 


+  0 


5 
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Table  9.2.  Joint  distribution  and  marginal  distributions  of  S  and  M. 


b 


a 

1 

2 

3 

4 

5 

6 

Ps{a) 

2 

1/36 

0 

0 

0 

0 

0 

1/36 

3 

0 

2/36 

0 

0 

0 

0 

2/36 

4 

0 

1/36 

2/36 

0 

0 

0 

3/36 

5 

0 

0 

2/36 

2/36 

0 

0 

4/36 

6 

0 

0 

1/36 

2/36 

2/36 

0 

5/36 

7 

0 

0 

0 

2/36 

2/36 

2/36 

6/36 

8 

0 

0 

0 

1/36 

2/36 

2/36 

5/36 

9 

0 

0 

0 

0 

2/36 

2/36 

4/36 

10 

0 

0 

0 

0 

1/36 

2/36 

3/36 

11 

0 

0 

0 

0 

0 

2/36 

2/36 

12 

0 

0 

0 

0 

0 

1/36 

1/36 

PM{h) 

1/36 

3/36 

5/36 

7/36 

9/36 

11/36 

1 

Thus  we  see  that  the  probabilities  of  S  can  be  obtained  by  taking  the  sum 
of  the  joint  probabilities  in  the  rows  of  Table  9.1.  This  yields  the  probability 
distribution  of  S,  i.e.,  all  values  of  ps{o)  for  a  =  2, ... ,  12.  We  speak  of  the 
marginal  distribution  of  S.  In  Table  9.2  we  have  added  this  distribution  in  the 
right  “margin”  of  the  table.  Similarly,  summing  over  the  columns  of  Table  9.1 
yields  the  marginal  distribution  of  M,  in  the  bottom  margin  of  Table  9.2. 
The  joint  distribution  of  two  random  variables  contains  a  lot  more  information 
than  the  two  marginal  distributions.  This  can  be  illustrated  by  the  fact  that  in 
many  cases  the  joint  probability  mass  function  of  X  and  Y  cannot  be  retrieved 
from  the  marginal  probability  mass  functions  px  and  py-  A  simple  example 
is  given  in  the  following  quick  exercise. 

Quick  exercise  9.2  Let  X  and  Y  be  two  discrete  random  variables,  with 
joint  probability  mass  function  p,  given  by  the  following  table,  where  e  is  an 
arbitrary  number  between  —1/4  and  1/4. 


b 

a 

0  1  Px{a) 

0 

1/4 -e  1/4 +  e 

1 

1/4 +  c  1/4 -e 

py{b) 

Complete  the  table,  and  conclude  that  we  cannot  retrieve  p  from  px  and  py. 
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The  joint  distribution  function 

As  in  the  case  of  a  single  random  variable,  the  distribution  function  enables 
us  to  treat  pairs  of  discrete  and  pairs  of  continuous  random  variables  in  the 
same  way. 


Definition.  The  joint  distribution  function  F  of  two  random  vari¬ 
ables  X  and  Y  is  the  function  F  :  — s-  [0, 1]  defined  by 

F{a,  b)  =  P(A  <  a,Y  <  b)  for  —  oo  <  a,b  <  oo. 

Quick  exercise  9.3  Compute  F(5,3)  for  the  joint  distribution  function  F 
of  the  pair  {S,M). 

The  distribution  functions  Fx  and  Fy  can  be  obtained  from  the  joint  distri¬ 
bution  function  of  X  and  Y.  As  before,  we  speak  of  the  marginal  distribution 
functions.  The  following  rule  holds. 


From  joint  to  marginal  distribution  function.  Let  F  be 
the  joint  distribution  function  of  random  variables  X  and  Y .  Then 
the  marginal  distribution  function  of  X  is  given  for  each  a  by 

Fx{a)  =  P(A  <  a)  =  F{a,  -|-oo)  =  lim  F(a,  6),  (9-1) 

b — >-oo 

and  the  marginal  distribution  function  of  Y  is  given  for  each  b  by 
Fy  (&)  =  P(y  <b)=  Fi+oo,  b)  =  lim  F(a,  b).  (9.2) 


9.2  Joint  distributions  of  continuous  random  variables 

We  saw  in  Chapter  5  that  the  probability  that  a  single  continuous  random 
variable  X  lies  in  an  interval  [a,  b],  is  equal  to  the  area  under  the  probability 
density  function  f  of  X  over  the  interval  (see  also  Figure  5.1).  For  the  joint 
distribution  of  continuous  random  variables  X  and  Y  the  situation  is  analo¬ 
gous:  the  probability  that  the  pair  {X,  Y)  falls  in  the  rectangle  [oi,  5i]  x  [02,  62] 
is  equal  to  the  volume  under  the  joint  probability  density  function  /(a;,  y)  of 
(A,  y)  over  the  rectangle.  This  is  illustrated  in  Figure  9.1,  where  a  chunk  of 
a  joint  probability  density  function  f(x,y)  is  displayed  for  x  between  —0.5 
and  1  and  for  y  between  —1.5  and  1.  Its  volume  represents  the  probability 
P(— 0.5  <  A  <  1,  —1.5  <  y  <  1).  As  the  volume  under  /  on  [—0.5,  l]x  [— 1.5, 1] 
is  equal  to  the  integral  of  /  over  this  rectangle,  this  motivates  the  following 
definition. 
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f(x,y) 


-2 


3 


-3-3 


Fig.  9.1.  Volume  under  a  joint  probability  density  function  /  on  the  rectangle 
[-0.5,1]  X  [-1.5,1]. 

Definition.  Random  variables  X  and  Y  have  a  joint  continuous 
distribution  if  for  some  function  /  :  >  K  and  for  all  numbers 

01,02  and  61,62  with  oi  <  61  and  02  <  62, 


P(oi  <  X  <  61,02  <  F  <  62)  =  /  /  f{x,y)dxdy. 


The  function  /  has  to  satisfy  f{x,y)  >  0  for  all  x  and  y,  and 
jTO  jTO  y)  dxdy  =  1.  We  call  /  the  joint  probability  density 

function  of  X  and  Y . 

As  in  the  one-dimensional  case  there  is  a  simple  relation  between  the  joint 
distribution  function  F  and  the  joint  probability  density  function  /: 


A  joint  probability  density  function  of  two  random  variables  is  also  called 
a  bivariate  probability  density.  An  explicit  example  of  such  a  density  is  the 
function 


,— 50a:^  — 50y^+80a:?/ 


f{x,y)  =  — e 

TT 


for  —00  <  a;  <  00  and  —00  <  y  <  00;  see  Figure  9.2.  This  is  an  example  of 
a  bivariate  normal  density  (see  Remark  11.2  for  a  full  description  of  bivariate 
normal  distributions). 

We  illustrate  a  number  of  properties  of  joint  continuous  distributions  by  means 


of  the  following  simple  example.  Suppose  that  X  and  Y  have  joint  probability 
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Fig.  9.2.  A  bivariate  normal  probability  density  function. 


density  function 


f{x,y) 


2 

75 


{2x'^y  +  xy"^) 


for  0  <  cc  <  3  and  1  <  y  <  2, 


and  /(a:,  y)  =  0  otherwise;  see  Figure  9.3. 


Fig.  9.3.  The  probability  density  function  f{x,  y)  =  ^(2x^y  +  a:y^). 
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As  an  illustration  of  how  to  compute  joint  probabilities: 


P(i<A<2,^<r<^)  = 


f{x,y)  dxdy 


75 

2 

75 


{2x^y  +  xy^)  dy  Ida; 


|xldx  = 


187 

2025' 


Next,  for  a  between  0  and  3  and  b  between  1  and  2,  we  determine  the  ex¬ 
pression  of  the  joint  distribution  function.  Since  /(a;,  y)  =  0  for  a;  <  0  or 
2/  <  1, 


F{a,b)  =P{X  <a,Y  <b)  =  f  (f  f{x,y)dyj 

J—OO  \  J  —  OQ  j 

/  (/  + 


da; 


2 


=  (2a^6^  -  2a^  -b  -  a^) . 

225 

Note  that  for  either  a  outside  [0,  3]  or  b  outside  [1,  2],  the  expression  for  F{a,  b) 
is  different.  For  example,  suppose  that  a  is  between  0  and  3  and  b  is  larger 
than  2.  Since  f{x,  y)  =  0  for  y  >  2,  we  find  for  any  b>  2: 

F{a,b)  =  P{X  <a,Y  <b)=  P{X  <  a,Y  <  2)  =  F{a,2)  =  7^{&a^  +  7a^). 

Hence,  applying  (9.1)  one  finds  the  marginal  distribution  function  of  X: 


Fx{a)  =  lim  F{a,b)  =  :^(6a^  +  7a^) 
6^00  225 


for  a  between  0  and  3. 

Quick  exercise  9.4  Show  that  Fyib)  =  f^(36^  -I- 186^  —  2l)  for  b  between  1 
and  2. 


The  probability  density  of  X  can  be  found  by  differentiating  Fx  ■ 


fx{x) 


da; 


Fxix) 


2 


(9a;^  +  7x) 


for  X  between  0  and  3.  It  is  also  possible  to  obtain  the  probability  density 
function  of  X  directly  from  /(a;,  y).  Recall  that  we  determined  marginal  prob¬ 
abilities  of  discrete  random  variables  by  summing  over  the  joint  probabilities 
(see  Table  9.2).  In  a  similar  way  we  can  find  fx-  For  x  between  0  and  3, 
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fx{x) 


f{x,y)  dy 


2 


{9x^  +  lx). 


This  illustrates  the  following  rule. 


From  joint  to  marginal  probability  density  function.  Let 
/  be  the  joint  probability  density  function  of  random  variables  X 
and  Y .  Then  the  marginal  probability  densities  of  X  and  Y  can  be 
found  as  follows: 

/OO  pOO 

f{x,y)dy  and  /y(?/)  =  /  f{x,y)dx. 

-OO  J  — OO 


Hence  the  probability  density  function  of  each  of  the  random  variables  X  and 
Y  can  easily  be  obtained  by  “integrating  out”  the  other  variable. 

Quick  exercise  9.5  Determine  fviy)- 


9.3  More  than  two  random  variables 

To  determine  the  joint  distribution  of  n  random  variables  Xi,  X2,  ■  ■  ■ ,  Xn,  all 
defined  on  the  same  sample  space  ff,  we  have  to  describe  how  the  probability 
mass  is  distributed  over  all  possible  values  of  {Xi,  X2,  ■  ■  ■ ,  X^).  In  fact,  it 
suffices  to  specify  the  joint  distribution  function  F  of  Xi,X2, . . .  ,Xn,  which 
is  defined  by 

F(fli ,  02 ,  .  ■  .  ,  On )  —  P  (^1  —  Hi  ?  ^2  F  CZ2  5  ■  ■  •  :  Xji  Fi  On') 
for  —OO  <  Oi,  O2,  ■■■,  On  <  OO. 

In  case  the  random  variables  Xi ,  X2, . . . ,  Xn  are  discrete,  the  joint  distribution 
can  also  be  characterized  by  specifying  the  joint  probability  mass  function  p 
oi  Xi,X2, ... ,  Xn,  defined  by 

p{^0\,  O2 ,  .  ■  ■  ,  On)  —  P(^l  —  Hi  j  X2  —  O2 ,  •  .  •  ,  Xn  —  On) 

for  — OO  <  Oi,  O2,  .  .  .  ,  On  <  OO. 

Drawing  without  replacement 

Let  us  illustrate  the  use  of  the  joint  probability  mass  function  with  an  example. 
In  the  weekly  Dutch  National  Lottery  Show,  6  balls  are  drawn  from  a  vase 
that  contains  balls  numbered  from  1  to  41.  Clearly,  the  first  number  takes 
values  1,  2, ...  ,41  with  equal  probabilities.  Is  this  also  the  case  for — say — the 
third  ball? 
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Let  us  consider  a  more  general  situation.  Suppose  a  vase  contains  balls  num¬ 
bered  1,  2, . . . ,  iV.  We  draw  n  balls  without  replacement  from  the  vase.  Note 
that  n  cannot  be  larger  than  N.  Each  ball  is  selected  with  equal  probability, 
i.e.,  in  the  first  draw  each  ball  has  probability  1/N,  in  the  second  draw  each  of 
the  N—  1  remaining  balls  has  probability  1/{N—  1),  and  so  on.  Let  Xi  denote 
the  number  on  the  ball  in  the  i-th  draw,  for  i  =  1,  2, . . . ,  n.  In  order  to  obtain 
the  marginal  probability  mass  function  of  Xi,  we  first  compute  the  joint  proba¬ 
bility  mass  function  of  Xi,X2, . . . ,  Xn-  Since  there  are  N(N—1)  ■  ■  ■  (N—n+1) 
possible  combinations  for  the  values  of  Xi,X2, . . . ,  Xn,  each  having  the  same 
probability,  the  joint  probability  mass  function  is  given  by 

p(ai,  02,  .  .  .  ,  On)  =  P(Xi  =  01,^2  =  02,  .  .  .  ,  Xn  =  0„) 

1 

“  N(N-l)---(N-n+l)’ 

for  all  distinct  values  oi,  02, . . . ,  o„  with  1  <  o^  <  N.  Clearly  Xi,X2,  ■  ■  ■ ,  Xn 
influence  each  other.  Nevertheless,  the  marginal  distribution  of  each  Xi  is 
the  same.  This  can  be  seen  as  follows.  Similar  to  obtaining  the  marginal 
probability  mass  functions  in  Table  9.2,  we  can  find  the  marginal  probability 
mass  function  of  Xi  by  summing  the  joint  probability  mass  function  over  all 
possible  values  oi  Xi,. . . ,  W_i,  W-i-i,  •  ■  • 

PXi{k)  =  y^p(oi, . . . ,  ai-i,k,  aj+i, . . . ,  On) 


^  N{N  -n+iy 

where  the  sum  runs  over  all  distinct  values  oi,  02, . . . ,  o„  with  1  <  Oj  <  iV 
and  Oi  =  k.  Since  there  are  {N  —  1){N  —  2)  •  •  •  {N  —  n  -I-  1)  such  combinations, 
we  conclude  that  the  marginal  probability  mass  function  of  Xi  is  given  by 

for  k  =  1,2, . . . ,  N.  We  see  that  the  marginal  probability  mass  function  of 
each  Xi  is  the  same,  assigning  equal  probability  1/A^  to  each  possible  value. 
In  case  the  random  variables  Xi,  X2,  ■  ■  ■ ,  Xn  are  continuous,  the  joint  dis¬ 
tribution  is  defined  in  a  similar  way  as  in  the  case  of  two  variables.  We  say 
that  the  random  variables  Xi,X2,  ■  ■  ■  ,Xn  have  a  joint  continuous  distribu¬ 
tion  if  for  some  function  /  :  K"  — >  M  and  for  all  numbers  oi,  02, . . . ,  o„  and 
61, 62,  ■■■,bn  with  Oi  <  bi. 


P(oi  <  Xi  <  02  <  X2  <b2,...,an<  Xn  <  bn) 

pbi  pb2  pbn 

=  /  •■■/  /(xi,a;2,...,a;„)da;ida;2--- da;„. 

Qi  02  On 


Again  /  has  to  satisfy  f{xi,X2,  ■  ■  ■ ,  Xn)  >  0  and  /  has  to  integrate  to  1.  We 
call  /  the  joint  probability  density  of  Ai,  A2, . . . ,  A„. 
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9.4  Independent  random  variables 

In  earlier  chapters  we  have  spoken  of  independence  of  random  variables,  an¬ 
ticipating  a  formal  definition.  On  page  46  we  postulated  that  the  events 

{i?i  =  tti},  {i?2  =  0-2},  ■  •  ■ ,  {.Rio  =  aio} 

related  to  the  Bernoulli  random  variables  Ri, ,  Rio  are  independent.  How 
should  one  define  independence  of  random  variables?  Intuitively,  random  vari¬ 
ables  X  and  Y  are  independent  if  every  event  involving  only  X  is  indepen¬ 
dent  of  every  event  involving  only  Y .  Since  for  two  discrete  random  variables 
X  and  Y,  any  event  involving  X  and  Y  is  the  union  of  events  of  the  type 
{X  =  a,Y  =  6},  an  adequate  definition  for  independence  would  be 

F{X  =  a,Y  =  b)  =  F{X  =  a)P{Y  =  b),  (9.3) 

for  all  possible  values  a  and  b.  However,  this  definition  is  useless  for  continuous 
random  variables.  Both  the  discrete  and  the  continuous  case  are  covered  by 
the  following  definition. 


Definition.  The  random  variables  X  and  Y,  with  joint  distribution 
function  R,  are  independent  if 


F{X  <a,Y  <b)  =  F{X  <  a)  F{Y  <  b) , 


that  is. 


F{a,b)  =  Fx{a)FY{b) 


(9.4) 


for  all  possible  values  a  and  b.  Random  variables  that  are  not  inde¬ 
pendent  are  called  dependent. 


Note  that  independence  of  X  and  Y  guarantees  that  the  joint  probability  of 
{X  <  a,  Y  <  b}  factorizes.  More  generally,  the  following  is  true:  if  X  and  Y 
are  independent,  then 

F{X  e  A,Y  G  B)  =F{X  G  A)F{Y  e  B),  (9.5) 

for  all  suitable  A  and  B,  such  as  intervals  and  points.  As  a  special  case  we 
can  take  A  =  {a},  B  =  {b},  which  yields  that  for  independent  X  and  Y  the 
probability  of  {X  =  a,Y  =  b}  equals  the  product  of  the  marginal  probabilities. 
In  fact,  for  discrete  random  variables  the  definition  of  independence  can  be 
reduced — after  cumbersome  computations — to  equality  (9.3).  For  continuous 
random  variables  X  and  Y  we  find,  differentiating  both  sides  of  (9.4)  with 
respect  to  x  and  y,  that 


f{x,y)  =  fx{x)fY{y)- 
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Quick  exercise  9.6  Determine  for  which  value  of  e  the  discrete  random 
variables  X  and  Y  from  Quick  exercise  9.2  are  independent. 

More  generally,  random  variables  Xi,  . . . ,  with  joint  distribution  func¬ 
tion  F ,  are  independent  if  for  all  values  ai, . . . ,  a„, 

F(ai ,  02 , . . . ,  a„)  =  Fxi  (oi )Fx2  (02 )  •  •  •  Fx„  (an)- 

As  in  the  case  of  two  discrete  random  variables,  the  discrete  random  variables 
Ai,  X2, . . . ,  Xn  are  independent  if 

P(Ai  =  ai,...,Xn  =  an)  =  P{Xi  =  oi)  •  •  •P(A„  =  a„) , 

for  all  possible  values  ai,...,a„.  Thus  we  see  that  the  definition  of  inde¬ 
pendence  for  discrete  random  variables  is  in  agreement  with  our  intuitive 
interpretation  given  earlier  in  (9.3). 

In  case  of  independent  continuous  random  variables  Xi,  A2, . . . ,  A„  with  joint 
probability  density  function  /,  differentiating  the  joint  distribution  function 
with  respect  to  all  the  variables  gives  that 


f{Xi,X2,  ...,Xn)=  fxAxi)fx2{x2)  '  '  '  fxA^n)  (9.6) 

for  all  values  xi, . . . ,  By  integrating  both  sides  over  (—00,  ci]  x  (—00,  02]  x 
•  •  •  X  (—00,  a„] ,  we  find  the  definition  of  independence.  Hence  in  the  continuous 
case,  (9.6)  is  equivalent  to  the  definition  of  independence. 


9.5  Propagation  of  independence 


A  natural  question  is  whether  transformed  independent  random  variables  are 
again  independent.  We  start  with  a  simple  example.  Let  X  and  Y  be  two 
independent  random  variables  with  joint  distribution  function  F.  Take  an 
interval  I  =  (a,  b]  and  define  random  variables  U  and  V  as  follows: 


1  if  A  e  / 

0  if  A  ^ 


and 


if  y  e  / 
if  y  ^ 


Are  U  and  V  independent?  Yes,  they  are!  By  using  (9.5)  and  the  independence 
of  A  and  Y,  we  can  write 


P{U  =  0,  y  =  1)  =  P(A  eF,Y  e  I) 

=  P(A  e  F)F{Y  e  I) 
=  p({7  =  o)P(y  =  1) . 


By  a  similar  reasoning  one  finds  that  for  all  values  a  and  b, 
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P{U  =  a,V  =  b)  =  P{U  =  a)  P{V  =  b) . 

This  illustrates  the  fact  that  for  independent  random  variables  Xi,  X2, . . . ,  X„, 
the  random  variables  Yi,  12,  •  ■  ■ ,  where  each  Yi  is  determined  by  Xi  only, 
inherit  the  independence  from  the  Xi.  The  general  rule  is  given  here. 


Propagation  of  independence.  Let  Xi,  X2, . . . ,  be  indepen¬ 
dent  random  variables.  For  each  z,  let  hi  :  K  ^  M  be  a  function  and 
define  the  random  variable 

Y,  =  h,{X,). 

Then  Yi,  Y2, . . . ,  Y„  are  also  independent. 


Often  one  uses  this  rule  with  all  functions  the  same:  hi  =  h.  For  instance,  in 
the  preceding  example. 


h{x) 


1  if  X  €  I 

0  if  a:  ^ 


The  rule  is  also  useful  when  we  need  different  transformations  for  different 
Xi.  We  already  saw  an  example  of  this  in  Chapter  6.  In  the  single-server 
queue  example  in  Section  6.4,  the  Exp{0.5)  random  variables  Ti,T2, . . .  and 
f7(2,  5)  random  variables  Si,  S2,  ■ . .  are  required  to  be  independent.  They  are 
generated  according  to  the  technique  described  in  Section  6.2.  With  a  se¬ 
quence  Ui,  U2, ...  of  independent  f7(0, 1)  random  variables  we  can  accomplish 
independence  of  the  Ti  and  Si  as  follows: 


T,  =  F“"(t/2.-i)  and  S,  =  G''^''{U2^), 


where  F  and  G  are  the  distribution  functions  of  the  Exp  {0.5)  distribution  and 
the  17(2,5)  distribution.  The  propagation-of-independence  rule  now  guaran¬ 
tees  that  all  random  variables  Ti,  Si,T2,  S2, . . .  are  independent. 


9.6  Solutions  to  the  quick  exercises 

9.1  The  only  possibilities  with  the  sum  equal  to  7  and  the  maximum  equal 
to  4  are  the  combinations  (3,4)  and  (4,3).  They  both  have  probability  1/36, 
so  that  P(S'  =  7,  M  =  4)  =  2/36. 

9.2  Since  px{0),  px{f),  Py{0),  and  py(l)  are  all  equal  to  1/2,  knowing  only 
px  and  py  yields  no  information  on  e  whatsoever.  You  have  to  be  a  student 
at  Hogwarts  to  be  able  to  get  the  values  of  p  right! 

9.3  Since  S  and  M  are  discrete  random  variables,  F(5,3)  is  the  sum  of  the 
probabilities  P {S  =  a,  M  =  b)  oi  all  combinations  (a,  b)  with  a  <  5  and  6  <  3. 
From  Table  9.2  we  see  that  this  sum  is  8/36. 
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9.4  For  a  between  0  and  3  and  for  b  between  1  and  2,  we  have  seen  that 

F{a,  b)  =  ^  {2a^b^  -  2a^  +  -  a^) . 

Since  f{x,  y)  =  0  for  x  >  3,  we  find  for  any  a  >  3  and  b  between  1  and  2: 
F(a,  b)  =  P(X  <a,Y  <b)  =  P(X  <  3,  P  <  6) 

=  F(3,6)  =  ^^(36^  + 1862-21). 

As  a  result,  applying  (9.2)  yields  that  FV(6)  =  lima^oo  P(a,  6)  =  F(3,6)  = 
^(36^  +  186^  —  2l),  for  6  between  1  and  2. 

9.5  For  y  between  1  and  2,  we  have  seen  that  Fyiy)  =  +  18?/^  —  2l). 

Differentiating  with  respect  to  y  yields  that 

for  y  between  1  and  2  (and  fviy)  =  0  otherwise).  The  probability  density 
function  of  Y  can  also  be  obtained  directly  from  f(x,y).  For  y  between  1 
and  2: 

/oo  2 

f{x,  y)dx=  —  J  {2x^y  +  xy'^)  dx 

=  — \—x^y-^ — x^y'^X^  n  =  — (Sy^  +  I2j/). 

75  '-3  2  ^  -1^=0  25 

Since  f{x,y)  =  0  for  values  of  y  not  between  1  and  2,  we  have  that  fviy)  = 
/(^^i  y)  da;  =  0  for  these  y's. 

9.6  The  number  e  is  between  —1/4  and  1/4.  Now  X  and  Y  are  independent 
in  case  p{i,j)  =  P(X  =  i^Y  =  j)  =  P(X  =  i)V{Y  =  j)  =  px{i)PY{3),  for  all 
i,  j  =  0, 1.  If  i  =  j  =  0,  we  should  have 

1  -  e  =  p(0,0)  =px(0)pv(0)  =  i. 

This  implies  that  e  =  0.  Furthermore,  for  all  other  combinations  (i,j)  one 
can  check  that  for  e  =  0  also  p{i,j)  =  Px{i)  PyU),  so  that  X  and  Y  are 
independent.  If  e  yf  0,  we  have  p(0,0)  px{0) Py{0),  so  that  X  and  Y  are 
dependent. 


9.7  Exercises 

9.1  The  joint  probabilities  P(Ai  =  a,Y  =  6)  of  discrete  random  variables  X 
and  Y  are  given  in  the  following  table  (which  is  based  on  the  magical  square 
in  Albrecht  Diirer’s  engraving  Melencolia  I  in  Figure  9.4).  Determine  the 
marginal  probability  distributions  of  X  and  P,  i.e.,  determine  the  probabilities 
P(X  =  a)  and  P(P  =  6)  for  a,  6  =  1,  2, 3, 4. 
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Fig.  9.4.  Albrecht  Diirer’s  Melencolia  I. 

Albrecht  Diirer  (German,  1471-1528)  Melencolia  I,  1514.  Engraving.  Bequest 
of  William  P.  Chapman,  Jr.,  Class  of  1895.  Courtesy  of  the  Herbert  F.  Johnson 
Museum  of  Art,  Cornell  University. 


a 

b 

1 

2 

3 

4 

1 

16/136 

3/136 

2/136 

13/136 

2 

5/136 

10/136 

11/136 

8/136 

3 

9/136 

6/136 

7/136 

12/136 

4 

4/136 

15/136 

14/136 

1/136 
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9.2  ffl  The  joint  probability  distribution  of  two  discrete  random  variables  X 
and  Y  is  partly  given  in  the  following  table. 


a 

b  0 

1  2 

P(F  =  b) 

-1 

1/2 

1 

1/2  ... 

1/2 

P(A  =  a)  1/6 

2/3  1/6 

1 

a.  Complete  the  table. 

b.  Are  X  and  Y  dependent  or  independent? 

9.3  Let  X  and  Y  be  two  random  variables,  with  joint  distribution  the  Melen- 
colia  distribution,  given  by  the  table  in  Exercise  9.1.  What  is 

a.  P{X  =  y)? 

b.  P(A  +  r  =  5)? 

c.  P(1  <  A  <  3,1  <  r  <  3)7 

d.  P((A,y)€{l,4}x{l,4})? 

9.4  This  exercise  will  be  easy  for  those  familiar  with  Japanese  puzzles  called 
nonograms.  The  marginal  probability  distributions  of  the  discrete  random 
variables  X  and  Y  are  given  in  the  following  table: 


b 

a 

II 

1  2  3  4  5 

1 

5/14 

2 

4/14 

3 

2/14 

4 

2/14 

5 

1/14 

P(A  =  a) 

1/14  5/14  4/14  2/14  2/14 

1 

Moreover,  for  a  and  b  from  1  to  5  the  joint  probability  P(A  =  a,Y  =  b)  is 
either  0  or  1/14.  Determine  the  joint  probability  distribution  of  X  and  Y. 

9.5  □  Let  r]  be  an  unknown  real  number,  and  let  the  joint  probabilities 
P(A  =  a,Y  =  b)  of  the  discrete  random  variables  X  and  Y  be  given  by  the 
following  table: 
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a 

b 

-1 

0 

1 

4 

^  ~  Te 

i-v 

0 

c; 

1 

3 

1 

8 

16 

8 

6 

^  +  1^ 

1 

16 

l-v 

a.  Which  are  the  values  rj  can  attain? 

b.  Is  there  a  value  of  ry  for  which  X  and  Y  are  independent? 

9.6  □  Let  X  and  Y  be  two  independent  Ber{^)  random  variables.  Define 
random  variables  U  and  V  by: 

U  =  X  +  Y  and  V=\X-Y\. 

a.  Determine  the  joint  and  marginal  probability  distributions  of  U  and  V. 

b.  Find  out  whether  U  and  V  are  dependent  or  independent. 

9.7  To  investigate  the  relation  between  hair  color  and  eye  color,  the  hair  color 
and  eye  color  of  5383  persons  was  recorded.  The  data  are  given  in  the  following 
table: 


Hair  color 

Eye  color 

Fair/red 

Medium  Dark/black 

Light 

1168 

825  305 

Dark 

573 

1312  1200 

Source:  B.  Everitt  and  G.  Dunn.  Applied  multivariate  data  analysis.  Second 
edition  Hodder  Arnold,  2001;  Table  4.12.  Reproduced  by  permission  of  Hodder 
&  Stoughton. 


Eye  color  is  encoded  by  the  values  1  (Light)  and  2  (Dark),  and  hair  color  by 
1  (Fair/red),  2  (Medium),  and  3  (Dark/black).  By  dividing  the  numbers  in 
the  table  by  5383,  the  table  is  turned  into  a  joint  probability  distribution  for 
random  variables  X  (hair  color)  taking  values  1  to  3  and  Y  (eye  color)  taking 
values  1  and  2. 

a.  Determine  the  joint  and  marginal  probability  distributions  of  X  and  Y . 

b.  Find  out  whether  X  and  Y  are  dependent  or  independent. 

9.8  ffl  Let  X  and  Y  be  independent  random  variables  with  probability  distri¬ 
butions  given  by 

P(X  =  0)  =  P(X  =  1)  =  i  and  P(y  =  0)  =  P(r  =  2)  =  i. 
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a.  Compute  the  distribution  of  Z  =  X  +  V. 

b.  Let  Y  and  Z  be  independent  random  variables,  where  Y  has  the  same 
distribution  as  Y,  and  Z  the  same  distribution  as  Z.  Compute  the  distri¬ 
bution  of  X  =  Z  —  Y. 

9.9  ffl  Suppose  that  the  joint  distribution  function  of  X  and  Y  is  given  by 

F{x,  y)  =  1  —  e~'^^  —  e“^  -I-  if  a;  >  0,  y  >  0, 

and  F{x,  y)  =  0  otherwise. 

a.  Determine  the  marginal  distribution  functions  of  X  and  Y. 

b.  Determine  the  joint  probability  density  function  of  X  and  Y . 

c.  Determine  the  marginal  probability  density  functions  of  X  and  Y. 

d.  Find  out  whether  X  and  Y  are  independent. 

9.10  □  Let  X  and  Y  be  two  continuous  random  variables  with  joint  proba¬ 
bility  density  function 

12 

fix,  y)  =  — xyil  +  y)  for  0  <  a:  <  1  and  0  <  y  <  1, 

5 

and  f{x,  y)  =  0  otherwise. 

a.  Find  the  probability  P(|<X<i,i<y<|). 

b.  Determine  the  joint  distribution  function  of  X  and  Y  for  a  and  b  between 
0  and  1. 

c.  Use  your  answer  from  b  to  find  Fx{a)  for  a  between  0  and  1. 

d.  Apply  the  rule  on  page  122  to  find  the  probability  density  function  of  X 
from  the  joint  probability  density  function  f(x,  y).  Use  the  result  to  verify 
your  answer  from  c. 

e.  Find  out  whether  X  and  Y  are  independent. 

9.11  ffl  Let  X  and  Y  be  two  continuous  random  variables,  with  the  same 
joint  probability  density  function  as  in  Exercise  9.10.  Find  the  probability 
P(A'  <  Y)  that  X  is  smaller  than  Y. 

9.12  The  joint  probability  density  function  /  of  the  pair  {X,Y)  is  given  by 

f{x,  y)  =  K (3a;^  -I-  8xy)  for  0  <  a;  <  1  and  0  <y  <2, 

and  f{x,y)  =  0  for  all  other  values  of  x  and  y.  Here  K  is  some  positive 
constant. 

a.  Find  K. 

b.  Determine  the  probability  P(2A  <  Y). 
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9.13  □  On  a  disc  with  origin  (0,  0)  and  radius  1,  a  point  {X,  Y)  is  selected  by 
throwing  a  dart  that  hits  the  disc  in  an  arbitrary  place.  This  is  best  described 
by  the  joint  probability  density  function  f  of  X  and  Y,  given  by 

/(*,!/)=  I'  <1 

|0  otherwise, 

where  c  is  some  positive  constant. 


a.  Determine  c. 

b.  Let  R  =  \/ X"^  +  be  the  distance  from  {X,  Y)  to  the  origin.  Determine 

the  distribution  function  Fj^. 

c.  Determine  the  marginal  density  function  fx-  Without  doing  any  calcula¬ 
tions,  what  can  you  say  about 


9.14  An  arbitrary  point  {X^Y)  is  drawn  from  the  square  [—1,1]  x  [—1,1]. 
This  means  that  for  any  region  G  in  the  plane,  the  probability  that  (A,  Y)  is 
in  G,  is  given  by  the  area  of  Gn  □  divided  by  the  area  of  □,  where  □  denotes 
the  square  [—1, 1]  x  [—1, 1]: 


p((A,y)eG) 


area  of  G  n  □ 
area  of  □ 


a.  Determine  the  joint  probability  density  function  of  the  pair  (A,  Y). 

b.  Check  that  A  and  Y  are  two  independent,  C/(— 1, 1)  distributed  random 
variables. 


9.15  ffl  Let  the  pair  (A,  A)  be  drawn  arbitrarily  from  the  triangle  A  with 
vertices  (0,0),  (0,1),  and  (1,1). 


a.  Use  Figure  9.5  to  show  that  the  joint  distribution  function  F  of  the  pair 
(A,  A)  satisfies 


F{a,b) 


/ 

0 

a{2b  —  a) 
<  62 

2a  — 

1 


for  a  or  b  less  than  0 

for  (a,  6)  in  the  triangle  A 

for  6  between  0  and  1  and  a  larger  than  6 

for  a  between  0  and  1  and  6  larger  than  1 

for  a  and  6  larger  than  1. 


b.  Determine  the  joint  probability  density  function  /  of  the  pair  (A,  A). 

c.  Show  that  fx  (x)  =  2  —  2x  for  x  between  0  and  1  and  that  fy  (y)  =  2y  for 
y  between  0  and  1. 

9.16  (Continuation  of  Exercise  9.15)  An  arbitrary  point  (f/,  V)  is  drawn  from 
the  unit  square  [0, 1]  x  [0, 1].  Let  A  and  A  be  defined  as  in  Exercise  9.15.  Show 
that  min{[/,  V}  has  the  same  distribution  as  A  and  that  max{C/,  V}  has  the 
same  distribution  as  A. 
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(0,1)  (1,1) 


9.17  Let  Ui  and  U2  be  two  independent  random  variables,  both  uniformly 
distributed  over  [0,a].  Let  V  =  mm{Ui,U2}  and  Z  =  max{C/i,  C/2}-  Show 
that  the  joint  distribution  function  of  V  and  Z  is  given  by 

-  (t-  s)^ 

F{s,  t)  =  F(y  <  s,  Z  <t)  =  - - -  for  0  <  s  <  t  <  a. 

a‘‘ 

Hint:  note  that  V  <  s  and  Z  <  t  happens  exactly  when  both  Ui  <  t  and 
U2  <  t,  but  not  both  s  <  Ui  <  t  and  s  <  U2  <t. 


9.18  Suppose  a  vase  contains  balls  numbered  1,2,...,A^.  We  draw  n  balls 
without  replacement  from  the  vase.  Each  ball  is  selected  with  equal  probability, 
i.e.,  in  the  first  draw  each  ball  has  probability  1/iV,  in  the  second  draw  each 
of  the  —  1  remaining  balls  has  probability  l/(iV  —  1),  and  so  on.  For  i  = 
1,2, ...  ,n,  let  Xi  denote  the  number  on  the  ball  in  the  ith  draw.  We  have 
shown  that  the  marginal  probability  mass  function  of  Xi  is  given  by 

PXiik)  =  ^,  for  fc=  l,2,...,iV. 


a.  Show  that 


E[X,] 


iV+  1 
2 


b.  Compute  the  variance  of  Xi.  You  may  use  the  identity 


1  +  4  +  9H - \-  N'^ 


In{N+1){2N+1). 

6 


9.19  □  Let  X  and  Y  be  two  continuous  random  variables,  with  joint  proba¬ 
bility  density  function 


for  —00  <  X  <  00  and  —00  <  y  <  00;  see  also  Figure  9.2. 
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a.  Determine  positive  numbers  a,  b,  and  c  such  that 

50a;^  —  80xy  +  50t/^  =  {ay  —  bx)"^  +  cx^  ■ 


b.  Setting  y.  =  ^x,  and  cr  =  ^,  show  that 


and  use  this  to  show  that 


/ 


■OO 


10 


—  OO 


c.  Use  the  results  from  b  to  determine  the  probability  density  function  fx 
of  X.  What  kind  of  distribution  does  X  have? 

9.20  Suppose  we  throw  a  needle  on  a  large  sheet  of  paper,  on  which  horizontal 
lines  are  drawn,  which  are  at  needle-length  apart  (see  also  Exercise  21.16). 
Choose  one  of  the  horizontal  lines  as  a;-axis,  and  let  {X,  Y)  be  the  center  of  the 
needle.  Furthermore,  let  Z  be  the  distance  of  this  center  {X,  Y)  to  the  nearest 
horizontal  line  under  (X,  F),  and  let  H  be  the  angle  between  the  needle  and 
the  positive  cc-axis. 

a.  Assuming  that  the  length  of  the  needle  is  equal  to  1,  argue  that  Z  has 
a  U{0, 1)  distribution.  Also  argue  that  H  has  a  U{0,tt)  distribution  and 
that  Z  and  H  are  independent. 

b.  Show  that  the  needle  hits  a  horizontal  line  when 


Z  <  —  sin  H  or  1  —  Z  <  —  sin  H. 
-  2  -  2 


c.  Show  that  the  probability  that  the  needle  will  hit  one  of  the  horizontal 
lines  equals  2/7r. 
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Covariance  and  correlation 


In  this  chapter  we  see  how  the  joint  distribution  of  two  or  more  random  vari¬ 
ables  is  used  to  compute  the  expectation  of  a  combination  of  these  random 
variables.  We  discuss  the  expectation  and  variance  of  a  sum  of  random  vari¬ 
ables  and  introduce  the  notions  of  covariance  and  correlation,  which  express 
to  some  extent  the  way  two  random  variables  influence  each  other. 


10.1  Expectation  and  joint  distributions 

China  vases  of  various  shapes  are  produced  in  the  Delftware  factories  in  the 
old  city  of  Delft.  One  particular  simple  cylindrical  model  has  height  H  and 
radius  R  centimeters.  Due  to  all  kinds  of  circumstances — the  place  of  the  vase 
in  the  oven,  the  fact  that  the  vases  are  handmade,  etc. — H  and  R  are  not 
constants  but  are  random  variables.  The  volume  of  a  vase  is  equal  to  the 
random  variable  V  =  -kHR?,  and  one  is  interested  in  its  expected  value  E  [V]. 
When  fv  denotes  the  probability  density  of  V,  then  by  definition 


However,  to  obtain  E[V],  we  do  not  necessarily  need  to  determine  fv  from 
the  joint  probability  density  f  oi  H  and  R\  Since  E  is  a  function  of  H  and  R, 
we  can  use  a  rule  similar  to  the  change-of- variable  formula  from  Chapter  7 : 


Suppose  that  H  has  a  17(25,35)  distribution  and  that  R  has  a  17(7.5,12.5) 
distribution.  In  the  case  that  H  and  R  are  also  independent,  we  have 
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>  —oo  J  — oo 

=  —  [  hdh 
50  As 


/OO  PCX)  n35  pl2.d 

/  Trhr'^fH{h)fii{r)dhdr=  /  /  nkr 

-OO  J —OO  J 2b  Jl.b 


•  id/idr 
10  5 


^12.5 


dr  =  9621.127  cm^. 


7.5 


This  illustrates  the  following  general  rule. 


Two-dimensional  change-of-variable  formula.  Let  X  and 
Y  be  random  variables,  and  let  g  :  >  K  be  a  function. 

If  X  and  Y  are  discrete  random  variables  with  values  ai,  02, . . .  and 
&i,  62,  •  ■  ■ )  respectively,  then 

E  [g{X,  Y)]=Y,Y.  =  auY  =  b,) . 

i  3 

If  X  and  Y  are  continuous  random  variables  with  joint  probability 
density  function  /,  then 

/OO  POO 

/  g{x,y)f{x,y)dxdy. 

-OO  J  — OO 


As  an  example,  take  g{x,  y)  =  xy  for  discrete  random  variables  X  and  Y  with 
the  joint  probability  distribution  given  in  Table  10.1.  The  expectation  of  XY 
is  computed  as  follows: 

E  [XY]  =  (0  •  0)  •  0  -k  (1  •  0)  •  i  -k  (2  •  0)  •  0 

+  (0-l).i  +  (l.l).0+(2.1).i 
-k  (0-2)-0-k(l-2)-i-h(2-2)-0  =  l. 

A  natural  question  is  whether  this  value  can  also  be  obtained  from  E  [A]  E  [E] . 
We  return  to  this  question  later  in  this  chapter.  First  we  address  the  expec¬ 
tation  of  the  sum  of  two  random  variables. 


Table  10.1.  Joint  probabilities  P(A  =  a,Y  =  b). 


a 

b 

0 

1 

2 

0 

0 

1/4 

0 

1 

1/4 

0 

1/4 

2 

0 

1/4 

0 
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Quick  exercise  10.1  Compute  E[X  +  y]  for  the  random  variables  with  the 
joint  distribution  given  in  Table  10.1. 

For  discrete  X  and  Y  with  values  oi,  02, . . .  and  61, 62,  •  ■  ■ ,  respectively,  we 
see  that 

E  [X  +  y]  =  ^  +  b,)v{x  =  a„  y  =  b,) 

^  3 

=  E  E  =  <^^,Y  =  b,)  +  E  E  =  =  b,) 

i  i  j 

i  ^  j  ^ 

yY.^,{^Y{X  =  =bi)\ 

j  ^  i  ^ 

=  ^a.P(X  =  a.)  +  ^6,P(y=6,) 

i  3 

=  E  [X]  +  E  [y] . 

A  similar  line  of  reasoning  applies  in  case  X  and  Y  are  continuous  random 
variables.  The  following  general  rule  holds. 


Linearity  of  expectations.  For  all  numbers  r,  s,  and  t  and 
random  variables  X  and  Y ,  one  has 

E  [rX  +  sy  +  t]  =  rE  [A]  +  sE  [y]  +  t. 

Quick  exercise  10.2  Determine  the  marginal  distributions  for  the  random 
variables  X  and  Y  with  the  joint  distribution  given  in  Table  10.1,  and  use 
them  to  compute  E  [A]  en  E  [y].  Check  that  E  [A]+E  [y]  is  equal  to  E  [A  +  A], 
which  was  computed  in  Quick  exercise  10.1. 

More  generally,  for  random  variables  Ai, . . . ,  A„  and  numbers  si, . . . ,  s„  and  t, 

E  [si  Ai  +  •  •  •  +  s„A„  +  t]  =  siE  [Ai]  +  •  •  •  +  SrjE  [A„]  +  t. 

This  rule  is  a  powerful  instrument.  For  example,  it  provides  an  easy  way  to 
compute  the  expectation  of  a  random  variable  A  with  a  Bin{n,p)  distribution. 
If  we  would  use  the  definition  of  expectation,  we  have  to  compute 

n  ^  /  \ 

E[x]  =  Y,kP{x  =  k)  =  Y,kQp\i-pr-\ 

k=0  k=0  ^  ' 

To  determine  this  sum  is  not  straightforward.  However,  there  is  a  simple  alter¬ 
native.  Recall  the  multiple-choice  example  from  Section  4.3.  We  represented 
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the  number  of  correct  answers  out  of  10  multiple-choice  questions  as  a  sum  of 
10  Bernoulli  random  variables.  More  generally,  any  random  variable  X  with 
a  Bin{n,p)  distribution  can  be  represented  as 

X  =  R\  -\-  R2  -l-  •  •  •  Rm 


where  i?i,  i?2,  ■  ■  ■ ,  Rn  are  independent  Ber{p)  random  variables,  i.e.. 


R^  = 


with  probability  p 
with  probability  1  —  p. 


Since  E  [i?^]  =  0  •  (1  —  p)  -I-  1  •  p  =  p,  for  every  i  =  1,  2, . . . ,  n,  the  linearity-of- 
expectations  rule  yields 


E  [X]  =  E  [Ri]  +  E  [R2]  •  •  •  -h  E  [i?„]  =  np. 

Hence  we  conclude  that  the  expectation  of  a  Bin(n,p)  distribution  equals  np. 


Remark  10.1  (More  than  two  random  variables).  In  both  the  discrete 
and  continuous  cases,  the  change-of- variable  formula  for  n  random  variables 
is  a  straightforward  generalization  of  the  change-of- variable  formula  for  two 
random  variables.  For  instance,  if  X\,  X2, .  ■  ■ ,  Xn  are  continuous  random 
variables,  with  joint  probability  density  function  /,  and  p  is  a  function  from 
R"  to  R,  then 

/oo  noo 

■  ■■  g{xi,  .  .  .  ,Xn)f{xi,  .  .  .  ,Xn)dXl  ■  ■  ■  dXn. 

-OO  J  —00 


10.2  Covariance 


In  the  previous  section  we  have  seen  that  for  two  random  variables  X  and  Y 
always 

E  [X  -b  r]  =  E  [X]  -b  E  [F] . 

Does  such  a  simple  relation  also  hold  for  the  variance  of  the  sum  Var(X  -b  Y) 
or  for  expectation  of  the  product  E[XF]?  We  will  investigate  this  in  the 
current  section. 

For  the  variables  X  and  Y  from  the  example  in  Section  9.2  with  joint  proba¬ 
bility  density 

2 

/(x,  y)  =  —  {2x^y  +  xy^)  for  0  <  x  <  3  and  1  <  y  <  2, 

I  0 


one  can  show  that 

939 


989  ,  791  4747 

2^  10  000  “  10  000 


Var(X  -b  Y) 


2000 


and  Var(X)  -b  Var(F) 
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(see  Exercise  10.10).  This  shows,  in  contrast  to  the  linearity-of-expectations 
rule,  that  Var(X  +  Y)  is  generally  not  equal  to  Var(X)+  Var(F).  To  deter¬ 
mine  Var(X  -I-  E),  we  exploit  its  definition: 

Var(X-Hy)  =  E[(X-hr-E[X-Hr])2]  . 

Now  X  +  Y  -Y[X  +  Y]  =  {X  -Y  [X])  -h  (r  -  E  [E]),  so  that 

{X  +  Y  -  Y[X  +  Y]f  =  (X  -  E  [X]f  -h  (E  -  E  [E])^ 

+  2(X-E[X])(E-E[E]). 

Taking  expectations  on  both  sides,  another  application  of  the  linearity-of- 
expectations  rule  gives 

Var(X  +  Y)=  Var(X)  -h  Var(E)  -H  2E  [(X  -  E  [X])(E  -  E  [E])] . 

That  is,  the  variance  of  the  sum  X  -|-  E  equals  the  sum  of  the  variances  of  X 
and  E,  plus  an  extra  term  2E[(X  —  E[X])(E  —  E[E])].  To  some  extent  this 
term  expresses  the  way  X  and  E  influence  each  other. 


Definition.  Let  X  and  E  be  two  random  variables.  The  covariance 
between  X  and  E  is  defined  by 

Cov(X,  E)  =  E  [(X  -  E  [X])(E  -  E  [E])]  . 


Loosely  speaking,  if  the  covariance  of  X  and  E  is  positive,  then  if  X  has  a 
realization  larger  than  E[X],  it  is  likely  that  E  will  have  a  realization  larger 
than  E[E],  and  the  other  way  around.  In  this  case  we  say  that  X  and  E  are 
positively  correlated.  In  case  the  covariance  is  negative,  the  opposite  effect  oc¬ 
curs;  X  and  E  are  negatively  correlated.  In  case  Cov(X,  E)  =  0  we  say  that  X 
and  E  are  uncorrelated.  An  easy  consequence  of  the  linearity-of-expectations 
property  (see  Exercise  10.19)  is  the  following  rule. 


An  alternative  expression  for  the  covariance.  Let  X  and 
E  be  two  random  variables,  then 

Cov(X,  E)  =  E  [XE]  -  E  [X]  E  [E]  . 


For  X  and  E  from  the  example  in  Section  9.2,  we  have  E[X]  =  109/50, 
E[E]  =  157/100,  and  E[XE]  =  171/50  (see  Exercise  10.10).  Thus  we  see  that 
X  and  E  are  negatively  correlated: 


Cov(X,  E) 


171  109  157 

^  ~  ^ '  Too 


13 

5000 


<  0. 


Moreover,  this  also  illustrates  that,  in  contrast  to  the  expectation  of  the  sum, 
for  the  expectation  of  the  product,  in  general  E  [XE]  is  not  equal  to  E  [X]  E  [E] . 
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Independent  versus  uncorrelated 

Now  let  X  and  Y  be  two  independent  random  variables.  One  expects  that  X 
and  Y  are  uncorrelated:  they  have  nothing  to  do  with  one  another!  This  is 
indeed  the  case,  for  instance,  if  X  and  Y  are  discrete;  one  finds  that 

E  [Xr]  =  ^  ^  a,b,V{X  =  ai,Y  =  b,) 

«  3 

=  EE  a^hjV{X  =  ai)V{Y  =  hj) 

«  3 

=  E  [X]  E  [F]  . 

A  similar  reasoning  holds  in  case  X  and  Y  are  continuous  random  variables. 
The  alternative  expression  for  the  covariance  leads  to  the  following  important 
observation. 


Independent  versus  uncorrelated.  If  two  random  variables 
X  and  Y  are  independent,  then  X  and  Y  are  uncorrelated. 


Note  that  the  reverse  is  not  necessarily  true.  If  X  and  Y  are  uncorrelated, 
they  need  not  be  independent.  This  is  illustrated  in  the  next  quick  exercise. 

Quick  exercise  10.3  Consider  the  random  variables  X  and  Y  with  the  joint 
distribution  given  in  Table  10.1.  Check  that  X  and  Y  are  dependent,  but  that 
also  E[Ar]  =  E[A]E[r]. 

From  the  preceding  we  also  deduce  the  following  rule  on  the  variance  of  the 
sum  of  two  random  variables. 


Variance  of  the  sum.  Let  X  and  Y  be  two  random  variables. 
Then  always 

Var(V  +  Y)  =  Var(V)  +  Var(F)  +  2Cov(V,  Y) . 

If  X  and  Y  are  uncorrelated, 

Var(V  +  F)  =  Var(V)  +  Var(F) . 


Hence,  we  always  have  that  E  [V  +  F]  =  E  [V]  +E  [F],  whereas  Var(V  +  F)  = 
Var(V)  +  Var(F)  only  holds  for  uncorrelated  random  variables  (and  hence  for 
independent  random  variables!). 

As  with  the  linearity-of-expectations  rule,  the  rule  for  the  variance  of  the 
sum  of  uncorrelated  random  variables  holds  more  generally.  For  uncorrelated 
random  variables  Xi,  X2,  ■  ■  ■ ,  V„,  we  have 


10.3  The  correlation  coefficient 
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Var(Xi  +X2  +  ---  +  Xn)=  Var(Xi)  +  Var(X2)  +  •  •  •  +  Var(X„) . 

This  rule  provides  an  easy  way  to  compute  the  variance  of  a  random  variable 
with  a  Bin(n,p)  distribution.  Recall  the  representation  for  a  Bin{n,p)  random 
variable  X: 

X  =  i?i  +  +  •  ■  ■  +  Rn- 

Each  Ri  has  variance 

Var(R0  =  E  [R^]  -  (E  [R,]f  =  0^  •  (1  -  p)  +  .  p  -  (E  [R,]f 

=  p-p2  =p(l  -p). 

Using  the  independence  of  the  Ri,  the  rule  for  the  variance  of  the  sum  yields 
Var(X)  =  Var(i?i)  +  Var(i?2)  +  •  •  •  +  Var(i?„)  =  np(l  —  p). 


10.3  The  correlation  coefficient 

In  the  previous  section  we  saw  that  the  covariance  between  random  vari¬ 
ables  gives  an  indication  of  how  they  influence  one  another.  A  disadvan¬ 
tage  of  the  covariance  is  the  fact  that  it  depends  on  the  units  in  which  the 
random  variables  are  represented.  For  instance,  suppose  that  the  length  in 
inches  and  weight  in  kilograms  of  Dutch  citizens  are  modeled  by  random  vari¬ 
ables  L  and  W.  Someone  prefers  to  represent  the  length  in  centimeters.  Since 
1  inch  =  2.53  cm,  one  is  dealing  with  a  transformed  random  variable  2.53L. 
The  covariance  between  2.53T  and  W  is 

Cov(2.53L,  IT)  =  E  [(2.53L)IT]  -  E  [2.53L]  E  [IT] 

=  2.53  (e  [LIT]  -  E  [L]  E  [IT]  )  =  2.53  Cov(L,  IT) . 

That  is,  the  covariance  increases  with  a  factor  2.53,  which  is  somewhat  dis¬ 
turbing  since  changing  from  inches  to  centimeters  does  not  essentially  alter 
the  dependence  between  length  and  weight.  This  illustrates  that  the  covari¬ 
ance  changes  under  a  change  of  units.  The  following  rule  provides  the  exact 
relationship. 


Covariance  under  change  of  units.  Let  X  and  Y  be  two 
random  variables.  Then 

Cov(rA  +  s,tY  +  u)  =  rt  Cov(A,  Y) 

for  all  numbers  r,  s,  t,  and  u. 


See  Exercise  10.14  for  a  derivation  of  this  rule. 
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Quick  exercise  10.4  For  X  and  Y  in  the  example  in  Section  9.2  (see  also 
Section  10.2),  show  that  Cov(— 2X  +  7,  SF  —  3)  =  13/500. 

The  preceding  discussion  indicates  that  the  covariance  Cov(X,  F)  may  not 
always  be  suitable  to  express  the  dependence  between  X  and  F.  For  this 
reason  there  is  a  standardized  version  of  the  covariance  called  the  correlation 
coefficient  of  X  and  F. 


Definition.  Let  X  and  F  be  two  random  variables.  The  correlation 
coefficient  p{X,  F)  is  defined  to  be  0  if  Var(X)  =  0  or  Var(F)  =  0, 
and  otherwise 


_  Cov(X,  F) 

~  v'Var(X)  Var(F) ' 


Note  that  p{X,Y)  remains  unaffected  by  a  change  of  units,  and  therefore  it 
is  dimensionless.  For  instance,  if  X  and  F  are  measured  in  kilometers,  then 
Cov(X,  F),  Var(X)  and  Var(F)  are  in  km^,  so  that  the  dimension  of  p{X,  Y) 
is  in  km^/('\/kn?  •  Vkm^). 

For  X  and  F  in  the  example  in  Section  9.2,  recall  that  Cov(X,  F)  =  —13/5000. 
We  also  have  Var(X)  =  989/2500  and  Var(F)  =  791/10  000  (see  Exer¬ 
cise  10.10),  so  that 

13 

p{X,Y)  =  ,  ^°°°  =  =  -0.0147. 

/  989  791 

V  2500  ■  10  000 

Quick  exercise  10.5  For  X  and  F  in  the  example  in  Section  9.2,  show  that 
pi-2X  -k  7,  5F  -  3)  =  0.0147. 

The  previous  quick  exercise  illustrates  the  following  linearity  property  for  the 
correlation  coefficient.  For  numbers  r,s,t,  and  u  fixed,  r,t  ^  0,  and  random 
variables  X  and  F : 


p{rX  +  s,tY  +  u) 


-p{X,Y)  ifrt<0, 
p{X,  Y)  if  rt  >  0. 


Thus  we  see  that  the  size  of  the  correlation  coefficient  is  unaffected  by  a  change 
of  units,  but  note  the  possibility  of  a  change  of  sign. 

Two  random  variables  X  and  F  are  “most  correlated”  if  X  =  F  or  if  X  =  — F. 
As  a  matter  of  fact,  in  the  former  case  p{X,  Y)  =  1,  while  in  the  latter  case 
p{X,Y)  =  —1.  In  general — for  nonconstant  random  variables  X  and  F — the 
following  property  holds: 


-1  <  p{X,Y)  <  1. 


For  a  formal  derivation  of  this  property,  see  the  next  remark. 


10.4  Solutions  to  the  quick  exercises 
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Remark  10.2  (Correlations  are  between  —1  and  1).  Here  we  give  a 
proof  of  the  preceding  formula.  Since  the  variance  of  any  random  variable 
is  nonnegative,  we  have  that 


0  <  Var 


=  Var 


Y 


xAmF) 

+  Var 


Y 


yva7(F) 


+  2Cov 


X _ Y  \ 

VVar(X)’7Vi4F)J 


Var(X)  Var(V)  2Cov(X,  Y) 
Var(X)  Var(V)  ^  ^Var(X)  Var(y) 


2(l  +  p(X,  Y)). 


This  implies  p{X,Y)  >  —1.  Using  the  same  argument  but  replacing  X  by 
—X  shows  that  p(X,Y)  <  1. 


10.4  Solutions  to  the  quick  exercises 


10.1  The  expectation  of  X  +  V  is  computed  as  follows: 

E  [X  +  r]  =  (0  +  0)  •  0  +  (1  +  0)  •  i  +  (2  +  0)  •  0 

+  (0+l)-i  +  (l  +  l)-0+(2+l).i 

+  (0  +  2)  •  0  +  (1  +  2)  •  i  +  (2  +  2)  •  0  =  2. 

10.2  First  complete  Table  10.1  with  the  marginal  distributions: 


b 

0 

a 

1 

2 

p(y  =  b) 

0 

0 

1/4 

0 

1/4 

1 

1/4 

0 

1/4 

1/2 

2 

0 

1/4 

0 

1/4 

P{X  =  a) 

1/4 

1/2 

1/4 

1 

It  follows  that  E[X]  =  O-j  +  l-  ^  +  2-  j  =  1,  and  similarly  E[F]  =  1. 
Therefore  E[X]  +  E[y]  =  2,  which  is  equal  to  E[X  +  y]  as  computed  in 
Quick  exercise  10.1. 
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10.3  From  Table  10.1,  as  completed  in  Quick  exercise  10.2,  we  see  that  X 
and  Y  are  dependent.  For  instance,  P(X  =  0,Y  =  0)  P(X  =  0)P(F  =  0). 
From  Quick  exercise  10.2  we  know  that  E  [X]  =  E  [P]  =  1.  Because  we  already 
computed  E[Xy]  =  1,  it  follows  that  E[XF]  =  E[X]E[y].  According  to  the 
alternative  expression  for  the  covariance  this  means  that  Cov(A,  E)  =  0,  i.e., 
X  and  Y  are  uncorrelated. 


10.4  We  already  computed  Cov(A,  Y)  =  —13/5000  in  Section  10.2.  Hence,  by 
the  linearity-of-covariance rule  Cov(— 2X  +  7, 5Y  —  3)  =  (—2)-5-(— 13/5000)  = 
13/500. 


10.5  From  Quick  exercise  10.4  we  have  Cov(— 2A  +  7,  hP  —  3)  =  13/500. 
Since  Var(X)  =  989/2500  and  Var(P)  =  791/10  000,  by  definition  of  the 
correlation  coefficient  and  the  rule  for  variances. 


p(-2A  +  7,5P-3) 


Cov(-2A  +  7,5y-3) 
v'Var(-2A  +  7)  •  Var(5P  -  3) 

13  13 

_  500  _ _ _  500  _ _  0.0147. 

W4Var(A)  •  25Var(P)  /3956  1977? 

''  V  2500  10  000 


10.5  Exercises 

10.1  □  Consider  the  joint  probability  distribution  of  X  and  Y  from  Exer¬ 
cise  9.7,  obtained  from  data  on  hair  color  and  eye  color,  for  which  we  already 
computed  the  expectations  and  variances  of  X  and  P,  as  well  as  E  [XP]. 

a.  Compute  Cov(A,  P).  Are  X  and  P  positively  correlated,  negative  corre¬ 
lated,  or  uncorrelated? 

b.  Compute  the  correlation  coefficient  between  X  and  P. 

10.2  □  Consider  the  two  discrete  random  variables  X  and  P  with  joint  dis¬ 
tribution  derived  in  Exercise  9.2: 


b 

0 

a 

1 

2 

P(P  =  b) 

-1 

1/6 

1/6 

1/6 

1/2 

1 

0 

1/2 

0 

1/2 

P(A  =  a) 

1/6 

2/3 

1/6 

1 

a.  Determine  E  [AP]. 

b.  Note  that  X  and  P  are  dependent.  Show  that  X  and  P  are  uncorrelated. 
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c.  Determine  Var(X  +  Y). 

d.  Determine  Var(X  —  y). 

10.3  Let  U  and  V  be  the  two  random  variables  from  Exercise  9.6.  We  have 
seen  that  U  and  V  are  dependent  with  joint  probability  distribution 


h 

0 

a 

1 

2 

P{V  =  b) 

0 

1/4 

0 

1/4 

1/2 

1 

0 

1/2 

0 

1/2 

P(t/  =  a) 

1/4 

1/2 

1/4 

1 

Determine  the  covariance  Cov{U,V)  and  the  correlation  coefficient  p{U,V). 

10.4  Consider  the  joint  probability  distribution  of  the  discrete  random  vari¬ 
ables  X  and  Y  from  the  Melencolia  Exercise  9.1.  Compute  Cov(X,  F). 


a 

b 

1 

2 

3 

4 

1 

16/136 

3/136 

2/136 

13/136 

2 

5/136 

10/136 

11/136 

8/136 

3 

9/136 

6/136 

7/136 

12/136 

4 

4/136 

15/136 

14/136 

1/136 

10.5  □  Suppose  X  and  Y  are  discrete  random  variables  taking  values  0,1, 
and  2.  The  following  is  given  about  the  joint  and  marginal  distributions: 


b 

0 

a 

1  2 

p(y  =  b) 

0 

8172 

...  10/72 

1/3 

1 

12/72 

9/72  . . . 

1/2 

2 

3/72  ... 

P(A  =  a) 

1/3 

1 

a.  Complete  the  table. 

b.  Compute  the  expectation  of  X  and  of  Y  and  the  covariance  between  X 
and  Y. 

c.  Are  X  and  Y  independent? 
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10.6  ffl  Suppose  X  and  Y  are  discrete  random  variables  taking  values  c—  1,  c, 
and  c+  1.  The  following  is  given  about  the  joint  and  marginal  distributions: 


b 

c  —  1 

a 

c 

c+  1 

P(A  =  b) 

c  —  1 

2/45 

9/45 

4/45 

1/3 

c 

7/45 

5/45 

3/45 

1/3 

c  +  1 

6/45 

1/45 

8/45 

1/3 

P(A  =  a) 

1/3 

1/3 

1/3 

1 

a.  Take  c  =  0  and  compute  the  expectation  of  X  and  of  Y  and  the  covariance 
between  X  and  Y. 

b.  Show  that  X  and  Y  are  uncorrelated,  no  matter  what  the  value  of  c  is. 
Hint:  one  could  compute  Cov(X,  K),  but  there  is  a  short  solution  using 
the  rule  on  the  covariance  under  change  of  units  (see  page  141)  together 
with  part  a. 

c.  Are  X  and  Y  independent? 

10.7  □  Consider  the  joint  distribution  of  Quick  exercise  9.2  and  take  e  fixed 
between  —1/4  and  1/4: 


b 

a 

0 

1 

Px{a) 

0 

1/4 -e  1/4 +  e 

1/2 

1 

1/4  +  e  1/4  —  £ 

1/2 

PY{b) 

1/2 

1/2 

1 

a.  Take  e  =  1/8  and  compute  Cov(A, T). 

b.  Take  e  =  1/8  and  compute  p{X,  Y). 

c.  For  which  values  of  e  is  p(X,Y)  equal  to  —1,  0,  or  1? 

10.8  Let  X  and  Y  be  random  variables  such  that 

E  [A]  =  2,  E  [Y]  =  3,  and  Var(A)  =  4. 

a.  Show  that  E  [A^]  =  8. 

b.  Determine  the  expectation  of  — 2A^  +  Y. 

10.9  ffl  Suppose  the  blood  of  1000  persons  has  to  be  tested  to  see  which  ones 
are  infected  by  a  (rare)  disease.  Suppose  that  the  probability  that  the  test 
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is  positive  is  p  =  0.001.  The  obvious  way  to  proceed  is  to  test  each  person, 
which  results  in  a  total  of  1000  tests.  An  alternative  procedure  is  the  following. 
Distribute  the  blood  of  the  1000  persons  over  25  groups  of  size  40,  and  mix 
half  of  the  blood  of  each  of  the  40  persons  with  that  of  the  others  in  each 
group.  Now  test  the  aggregated  blood  sample  of  each  group:  when  the  test  is 
negative  no  one  in  that  group  has  the  disease;  when  the  test  is  positive,  at 
least  one  person  in  the  group  has  the  disease,  and  one  will  test  the  other  half 
of  the  blood  of  all  40  persons  of  that  group  separately.  In  total,  that  gives  41 
tests  for  that  group.  Let  Xi  be  the  total  number  of  tests  one  has  to  perform 
for  the  Ah  group  using  this  alternative  procedure. 

a.  Describe  the  probability  distribution  of  Xi,  i.e.,  list  the  possible  values  it 
takes  on  and  the  corresponding  probabilities. 

b.  What  is  the  expected  number  of  tests  for  the  Ah  group?  What  is  the 
expected  total  number  of  tests?  What  do  you  think  of  this  alternative 
procedure  for  blood  testing? 

10.10  ffl  Consider  the  variables  X  and  Y  from  the  example  in  Section  9.2 
with  joint  probability  density 

2 

f{x,  y)  =  —  {2x^y  +  xy^)  for  0  <  x  <  3  and  I  <  y  <2 

I  o 

and  marginal  probability  densities 

fx{x)  =  ^ (9x^  +  7x)  for  0  <  X  <  3 
/y(2/)  =  ^(3y^  +  12?/)  forl<?/<2. 

a.  Compute  E  [A],  E  [F],  and  E  [A  +  y]. 

b.  Compute  E  [A^]  ,  E  [F^]  ^  e  [AF],  and  E  [(A  +  F)^] , 

c.  Compute  Var(A  +  F),  Var(A),  and  Var(F)  and  check  that  Var(A  +  F) 
Var(A)  +  Var(F). 


10.11  Recall  the  relation  between  degrees  Celsius  and  degrees  Fahrenheit 


degrees  Fahrenheit 


■  degrees  Celsius  +  32. 


Let  A  and  F  be  the  average  daily  temperatures  in  degrees  Celsius  in  Ams¬ 
terdam  and  Antwerp.  Suppose  that  Cov(A,  F)  =  3  and  p{X,  F)  =  0.8.  Let  T 
and  S  be  the  same  temperatures  in  degrees  Fahrenheit.  Compute  Cov(T,  S) 
and  p{T,  S). 


10.12  Consider  the  independent  random  variables  H  and  R  from  the  vase 
example,  with  a  1/(25,35)  and  a  17(7.5,12.5)  distribution.  Compute  E[iJ] 
and  E  [i?^]  and  check  that  E  [F]  =  ttE  [H]  E  [i?^] . 


148 


10  Covariance  and  correlation 


10.13  Let  X  and  Y  be  as  in  the  triangle  example  in  Exercise  9.15.  Recall  from 
Exercise  9.16  that  X  and  Y  represent  the  minimum  and  maximum  coordinate 
of  a  point  that  is  drawn  from  the  unit  square:  X  =  min{[/,  E}  and  Y  = 
max{C/,  V}. 

a.  Show  that  F,[X]  =  1/3,  Var(X)  =  1/18,  E[y]  =  2/3,  and  Var(r)  =  1/18. 
Hint:  you  might  consult  Exercise  8.15. 

b.  Check  that  Var(X  +  Y)  =  1/6,  by  using  that  U  and  V  are  independent 
and  that  X  +  Y  =  [/  +  V. 

c.  Determine  the  covariance  Cov(X,  Y)  using  the  results  from  a  and  b. 

10.14  ffl  Let  X  and  Y  be  two  random  variables  and  let  r,s,t,  and  u  be 
arbitrary  real  numbers. 

a.  Derive  from  the  definition  that  Cov(X  +  s,  E  +  m)  =  Cov(X,  Y). 

b.  Derive  from  the  definition  that  Cov{rX,tY)  =  rtCov{X,Y). 

c.  Combine  parts  a  and  b  to  show  Cov(rX  +  s,  tE  +  u)  =  rtCov{X,  E). 

10.15  In  Figure  10.1  three  plots  are  displayed.  For  each  plot  we  carried  out  a 
simulation  in  which  we  generated  500  realizations  of  a  pair  of  random  variables 
{X,Y).  We  have  chosen  three  different  joint  distributions  of  X  and  E. 


2  - 

0  - 

-2  - 


\ 


-2  0  2 


Fig.  10.1.  Some  scatterplots. 


a.  Indicate  for  each  plot  whether  it  corresponds  to  random  variables  X  and 
E  that  are  positively  correlated,  negatively  correlated,  or  uncorrelated. 

b.  Which  plot  corresponds  to  random  variables  X  and  E  for  which  \p{X,  Y)  \ 
is  maximal? 

10.16  □  Let  X  and  E  be  random  variables. 

a.  Express  Cov(X,  X  +  E)  in  terms  of  Var(X)  and  Cov(X,  E). 

b.  Are  X  and  X  +  Y  positively  correlated,  uncorrelated,  or  negatively  cor¬ 
related,  or  can  anything  happen? 
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c.  Same  question  as  in  part  b,  but  now  assume  that  X  and  Y  are  uncorre¬ 
lated. 

10.17  Extending  the  variance  of  the  sum  rule.  For  mathematical  con¬ 
venience  we  first  extend  the  sum  rule  to  three  random  variables  with  zero 
expectation.  Next  we  further  extend  the  rule  to  three  random  variables  with 
nonzero  expectation.  By  the  same  line  of  reasoning  we  extend  the  rule  to  n 
random  variables. 

a.  Let  X,Y  and  Z  be  random  variables  with  expectation  0.  Show  that 

Var(X  +  Y+Z)=  Var(X)  -f  Var(r)  -f  Var(Z) 

-f  2Cov(X,  Y)  +  2Cov{X,  Z)  +  2Cov{Y,  Z) . 

Hint:  directly  apply  that  for  real  numbers  yi , . . . ,  j/n 

(yi  +  •  •  •  +  Vn)^  =  2/i  +  ■  ■  ■  +  ?/n  +  2yiy2  +  ‘^yiys  +  •  •  •  +  2y„_iy„. 

b.  Now  show  a  for  X,  Y,  and  Z  with  nonzero  expectation. 

Hint:  you  might  use  the  rules  on  pages  98  and  141  about  variance  and 
covariance  under  a  change  of  units. 

c.  Derive  a  general  variance  of  the  sum  rule,  i.e.,  show  that  if  Xi,  X2, . . . , 
are  random  variables,  then 


Var(Jfi  -|-  X2  -l-  ■  •  •  -l-  Xn) 

=  Var(Xi) -f  •  •  • -fVar(X„) 

+2Cov(Xi,X2)  -f  2Cov(Xi,X3)  -f  •  •  •  -f  2Cov(Xi,X„) 

+  2Cov(X2,  X3)  -f  •  •  •  -f  2Cov(X2,  X„) 

+  2Cov(X„_i,X„). 

d.  Show  that  if  the  variances  are  all  equal  to  and  the  covariances  are  all 
equal  to  some  constant  7,  then 


Var(Xi  +  X2  + - h  Xn)  =  ncr^  -I-  n{n  -  1)7. 

10.18  ffl  Consider  a  vase  containing  balls  numbered  1,2,..., TV.  We  draw 
n  balls  without  replacement  from  the  vase.  Each  ball  is  selected  with  equal 
probability,  i.e.,  in  the  first  draw  each  ball  has  probability  1/TV,  in  the  second 
draw  each  of  the  TV  —  1  remaining  balls  has  probability  1/(TV  —  1),  and  so 
on.  For  i  =  1, 2, . . . ,  n,  let  Xi  denote  the  number  on  the  ball  in  the  Tth  draw. 
From  Exercise  9.18  we  know  that  the  variance  of  Xi  equals 

Var(X,)  =  ^(TV-l)(TV+l). 
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Show  that 

Cov(Xi,X2)  =  -^(iV+l). 

Before  you  do  the  exercise:  why  do  you  think  the  covariance  is  negative? 
Hint:  use  Var(Xi  +  X2  + - h  Xn)  =  0  (why?),  and  apply  Exercise  10.17. 

10.19  Derive  the  alternative  expression  for  the  covariance:  Cov(X,  Y)  = 
E[xr]  -E[x]E[y]. 

Hint:  work  out  (X  —  E  [X])(y  —  E  [F])  and  use  linearity  of  expectations. 

10.20  Determine  p(U,U‘^)  when  U  has  a  U{0,a)  distribution.  Here  a  is  a 
positive  number. 
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More  computations  with  more  random 
variables 


Often  one  is  interested  in  combining  random  variables,  for  instance,  in  taking 
the  sum.  In  previous  chapters,  we  have  seen  that  it  is  fairly  easy  to  describe 
the  expected  value  and  the  variance  of  this  new  random  variable.  Often  more 
details  are  needed,  and  one  also  would  like  to  have  its  probability  distribu¬ 
tion.  In  this  chapter  we  consider  the  probability  distributions  of  the  sum,  the 
product,  and  the  quotient  of  two  random  variables. 


11.1  Sums  of  discrete  random  variables 

In  a  solo  race  across  the  Pacific  Ocean,  a  ship  has  one  spare  radio  set  for 
communications.  Each  of  the  two  radios  has  probability  p  of  failing  each  time 
it  is  switched  on.  The  skipper  uses  the  radio  once  every  day.  Let  X  be  the 
number  of  days  the  radio  is  switched  on  until  it  fails  (so  if  the  radio  can  be 
used  for  two  days  and  fails  on  the  third  day,  X  attains  the  value  3).  Similarly, 
let  Y  be  the  number  of  days  the  spare  radio  is  switched  on  until  it  fails.  Note 
that  these  random  variables  are  similar  to  the  one  discussed  in  Section  4.4, 
which  modeled  the  number  of  cycles  until  pregnancy.  Hence,  X  and  Y  are 
Geo{p)  distributed  random  variables.  Suppose  that  p  =  1/75  and  that  the 
trip  will  last  100  days.  Then  at  first  sight  the  skipper  does  not  need  to  worry 
about  radio  contact:  the  number  of  days  the  first  radio  lasts  is  If  —  1  days, 
and  similarly  the  spare  radio  lasts  Y  —  I  days.  Therefore  the  expected  number 
of  days  he  is  able  to  have  radio  contact  is 

E  [V  -  1  r  -  1]  =  E  [V]  E  [V]  -  2  =  i  i  -  2  =  148  days! 

The  skipper — who  has  some  training  in  probability  theory — still  has  some 
concerns  about  the  risk  he  runs  with  these  two  radios.  What  if  the  probability 
P(X  -I-  y  —  2  <  99)  that  his  two  radios  break  down  before  the  end  of  the  trip 
is  large? 
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This  example  illustrates  that  it  is  important  to  study  the  probability  distri¬ 
bution  of  the  sum  Z  =  X  +  Y  of  two  discrete  random  variables.  The  random 
variable  Z  takes  on  values  -I-  bj,  where  is  a  possible  value  of  X  and  bj 
of  Y.  Hence,  the  probability  mass  function  of  Z  is  given  by 

Pz{c)  =  ^  P{X  =  a^,Y  =bj), 

where  the  sum  runs  over  all  possible  values  of  X  and  bj  of  Y  such  that 
tti  -b  bj  =  c.  Because  the  sum  only  runs  over  values  that  are  equal  to  c—  bj, 
we  simplify  the  summation  and  write 

Pz{c)  =  =  c-bj,Y  =  bj) , 

3 

where  the  sum  runs  over  all  possible  values  bj  of  Y .  When  X  and  Y  are 
independent,  then  P(X  =  c  —  bj,  Y  =  bj)  =  P(X  =  c  —  bj)Y(Y  =  bj).  This 
leads  to  the  following  rule. 


Adding  two  independent  discrete  random  variables.  Let  X 
and  Y  be  two  independent  discrete  random  variables,  with  probabil¬ 
ity  mass  functions  px  and  py-  Then  the  probability  mass  function 
pz  of  Z  =  X  +  Y  satisfies 

Pz{c)  =  '^Px{c-bj)pY{bj), 

3 

where  the  sum  runs  over  all  possible  values  bj  of  Y. 

Quick  exercise  11.1  Let  S  be  the  sum  of  two  independent  throws  with 
a  die,  so  ^  =  A  -b  P,  where  X  and  Y  are  independent,  and  P{X  =  k)  = 
P(F  =  k)  =  1/6,  for  fc  =  1, . . . ,  6.  Use  the  addition  rule  to  compute  P(5'  =  3) 
and  P(S'  =  8),  and  compare  your  answers  with  Table  9.2. 

In  the  solo  race  example,  X  and  Y  are  independent  Geo{p)  distributed  random 
variables.  Let  Z  =  X  +  Y-,  then  by  the  above  rule  for  fc  >  2 

OO 

P(A  +  Y  =  k)  =  pz{k)  =  '^px{k  -  1)py{1). 

fci 

Because  px{<^)  =  0  for  a  <  0,  all  terms  in  this  sum  with  i  >  k  vanish,  hence 

k-l  k-1 

p{x  +  Y  =  k)  =  Y,px{k-e)-  PY{e)  =  ^(1  -  p)"-'- V  •  (1  - 

k—1 

=  ^p2(l-p)'=-2  =  (fc-  l)p2(l_p)fc-2, 

e=i 

Note  that  X  +  Y  does  not  have  a  geometric  distribution. 
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Remark  11.1  (The  expected  value  of  a  geometric  distribution). 

The  preceding  gives  us  the  opportunity  to  calculate  the  expected  value  of 
the  geometric  distribution  in  an  easy  way.  Since  the  probabilities  of  Z  add 
up  to  one: 


fc=2  fc=2  £=1 

it  follows  that 

oo 

E[X]  =  ^Ml-p)'-i  =  -. 

i=\  " 

Returning  to  the  solo  race  example,  it  is  clear  that  the  skipper  does  have 
grounds  to  worry: 


P{X  +  Y  -2<99) 


101 

P(X  +  Y  <  101)  =  P{X  +  Y  =  k) 

k^2 

101 

^(fc-l)(i)2(l-i)'=-2  =  0.3904. 

fc=2 


The  sum  of  two  binomial  random  variables 

It  is  not  always  necessary  to  use  the  addition  rule  for  two  independent  discrete 
random  variables  to  find  the  distribution  of  their  sum.  For  example,  let  X  and 
Y  be  two  independent  random  variables,  where  X  has  a  Bin{n,p)  distribution 
and  Y  has  a  Bin{m,p)  distribution.  Since  a  Bin{n,p)  distribution  models 
the  number  of  successes  in  n  independent  trials  with  success  probability  p, 
heuristically,  X  +  Y  represents  the  number  of  successes  in  n  +  m  trials  with 
success  probability  p  and  should  therefore  have  a  Bin{n  +  m,p)  distribution. 
A  more  formal  reasoning  is  the  following.  Let 

Bl ,  R2 ,  •  ■  ■  ;  Rjl ,  5*1 ,  S2  ;  ■  •  ■  ,  Sjri 

be  independent  Ber{p)  distributed  random  variables.  Recall  that  a  Bm{n,p) 
distributed  random  variable  has  the  same  distribution  as  the  sum  of  n  inde¬ 
pendent  Ber{p)  distributed  random  variables  (see  Section  4.3  or  10.2).  Hence 
X  has  the  same  distribution  as  i?i  -I-  i?2  +  •  ■  ■  +  Rn  and  Y  has  the  same 
distribution  as  S'!  -b  52  -b  •  •  •  -b  Sm  ■  This  means  that  X  +  Y  has  the  same  dis¬ 
tribution  as  the  sum  of  n-bm  independent  Berijp)  variables  and  therefore  has 
a  Bin{n  +  m,p)  distribution.  This  can  also  be  verified  analytically  by  means 
of  the  addition  rule,  using  that  X  and  Y  are  also  independent. 

Quick  exercise  11.2  For  i  =  1,2,3,  let  Xi  be  a  Bin{ni,p)  distributed  ran¬ 
dom  variable,  and  suppose  that  Ai,  A2,  and  X3  are  independent.  Argue  that 
Z  =  Xi  +  X2  -b  X3  is  a  Bin{ni  -b  n2  +  R3,p)  distributed  random  variable. 
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11.2  Sums  of  continuous  random  variables 


Let  X  and  Y  be  two  continuous  random  variables.  What  can  we  say  about  the 
probability  density  function  oi  Z  =  X+Y7  We  start  with  an  example.  Suppose 
that  X  and  Y  are  two  independent,  U{0, 1)  distributed  random  variables.  One 
might  be  tempted  to  think  that  Z  is  also  uniformly  distributed. 

Note  that  the  joint  probability  density  function  f  of  X  and  Y  is  equal  to  the 
product  of  the  marginal  probability  functions  fx  and  fy'- 


f{x,y)  =  fx{x)fY{y)  =  1  for  0  <  a;  <  1  and  0  <  y  <  1, 


and  f{x,  y)  =  0  otherwise.  Let  us  compute  the  distribution  function  Fz  of  Z. 
It  is  easy  to  see  that  Fz{a)  =  0  for  a  <  0  and  Fz(a)  =  1  for  a  >  2.  For  a 
between  0  and  1,  let  G  be  that  part  of  the  plane  below  the  line  x  +  y  =  a,  and 
let  A  be  the  triangle  with  vertices  (0,0),  (a,0),  and  (0,a);  see  Figure  11.1. 


X  +  y  =  a 


Fig.  11.1.  The  region  G  in  the  plane  where  x  +  y  <  a  (with  0  <  a  <  1)  intersected 
with  A. 


Since  f{x,y)  =  0  outside  [0, 1]  x  [0, 1],  the  distribution  function  of  Z  is  given 
by 

Fz{a)  =  V{Z  <a)  =  P(A  +  Y  <a) 


jj  f{x,y)  dxdy  =  JJ  Idxdy 


G 


A 


for  0  <  a  <  1.  For  the  case  where  1  <  a  <  2  one  can  draw  a  similar  figure  (see 
Figure  11.2),  from  which  one  can  find  that 


Fz{a)  =  1  —  i(2  —  a)^  for  1  <  a  <  2. 
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x  +  y  ^  a 


Fig.  11.2.  The  region  G  in  the  plane  where  x  +  y  <  a  (with  1  <  a  <  2)  intersected 
with  A. 


We  see  that  Z  is  not  uniformly  distributed. 

In  general,  the  distribution  function  Fz  of  the  sum  Z  of  two  continuous  ran¬ 
dom  variables  X  and  Y  is  given  by 

Fz{a)  =  V{Z  <a)=V{X  +  Y  <a)=  JJ  f{x,y)dxdy. 

{x,y):x+y<a 

The  double  integral  on  the  right-hand  side  can  be  written  as  a  repeated  in¬ 
tegral,  first  over  x  and  then  over  y.  Note  that  x  and  y  are  between  minus 
and  plus  infinity  and  that  they  also  have  to  satisfy  a:  -I-  y  <  a  or,  equivalently, 
X  <  a  —  y.  This  means  that  the  integral  over  x  runs  from  minus  infinity  to 
y  —  a,  and  the  integral  over  y  runs  from  minus  infinity  to  plus  infinity.  Hence 

/oo  /  ra-v  \ 

(  /  f{x,y)  dxj  dy. 

-oo  \J  — oo  / 


In  case  X  and  Y  are  independent,  the  last  double  integral  can  be  written  as 


/oo  /  pa-y  \ 

(/  fx{x)dx]  fY{y)dy, 
-oo  —  oo  / 


and  we  find  that 


for  —00  <  a  <  00. 


/OO 

Fx{a-  y)fY{y)  dy 

-OO 

Differentiating  Fz  we  find  the  following  rule. 
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Adding  two  independent  continuous  random  variables. 
Let  X  and  Y  be  two  independent  continuous  random  variables,  with 
probability  density  functions  fx  and  fy-  Then  the  probability  den¬ 
sity  function  fz  of  Z  =  X  +  Y  is  given  by 


for  —00  <  z  <  oo. 

The  single-server  queue  revisited 

In  the  single-server  queue  model  from  Section  6.4,  Ti  is  the  time  between 
the  start  at  time  zero  and  the  arrival  of  the  first  customer  and  Ti  is  the 
time  between  the  arrival  of  the  {i  —  l)th  and  ith  customer  at  a  well.  We  are 
interested  in  the  arrival  time  of  the  nth  customer  at  the  well.  For  n  >  1,  let 
Zn  be  the  arrival  time  of  the  nth  customer  at  the  well:  =  Ti  -|-  •  •  •  -I-  r„. 

Since  each  Ti  has  an  Exp{f).f>)  distribution,  it  follows  from  the  linearity-of- 
expectations  rule  in  Section  10.1  that  the  expected  arrival  time  of  the  nth 
customer  is 


E  [Zn\  —  E  [Ti  Tn\  —  E  [Ti]  -|-  •  •  •  -l-  E  [Tn]  —  2n  minutes. 


We  would  like  to  know  whether  the  pump  capacity  is  sufficient;  for  instance, 
when  the  service  times  Si  are  independent  C/(2,  5)  distributed  random  vari¬ 
ables  (this  is  the  case  when  the  pump  capacity  u  =  1).  In  that  case,  at  most 
30  customers  can  pump  water  at  the  well  in  the  first  hour.  If  P(Z3o  <  60)  is 
large,  one  might  be  tempted  to  increase  the  capacity  of  the  well. 

Recalling  that  the  Ti  are  independent  Exp{X)  random  variables,  it  follows 
from  the  addition  rule  that  /ti-i-T2(-z)  =  0  if  z  <  0,  and  for  z  >  0  that 


Viewing  T1  +  T2  +  T3  as  the  sum  of  Ti  and  T2  +  T3,  we  find,  by  applying  the 
addition  rule  again,  that  fz^iz)  =  0  if  z  <  0,  and  for  z  >  0  that 
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Repeating  this  procedure,  we  find  that  fz„  (z)  =  0  if  z  <  0,  and 


fzr,{z) 


A(Az)"~^e-^^ 
(n-  1)! 


for  z  >  0.  Using  integration  by  parts  we  find  (see  Exercise  11.13)  that  for 
n  >  1  and  a  >  0: 

P(Z„  <«)  =  !-  2^. 

1=0 

Since  A  =  1/2,  it  follows  that 


P(^30  <  60)  =  0.524. 


Even  if  each  customer  fills  his  jerrican  in  the  minimum  time  of  2  minutes,  we 
see  that  after  an  hour  with  probability  0.524,  people  will  be  waiting  at  the 
pump! 

The  random  variable  is  an  example  of  a  gamma  random  variable,  defined 
as  follows. 


Definition.  A  continuous  random  variable  X  has  a  gamma  dis¬ 
tribution  with  parameters  a  >  0  and  A  >  0  if  its  probability  density 
function  /  is  given  by  /(x)  =  0  for  a;  <  0  and 


f{x) 


for  X  >  0, 


where  the  quantity  r(a)  is  a  normalizing  constant  such  that  /  inte¬ 
grates  to  1.  We  denote  this  distribution  by  Gam{a,X). 


The  quantity  r(a)  is  for  a  >  0  defined  by 

poo 

r(a)  =  /  t““^e“*dt. 

Jo 

It  satisfies  for  a  >  0  and  n=  1,2,... 

r(a -I- 1)  =  ar(a)  and  r(n)  =  (n  — 1)1 

(see  also  Exercise  11.12).  It  follows  from  our  example  that  the  sum  of  n  inde¬ 
pendent  Exp{X)  distributed  random  variables  has  a  Gamin,  X)  distribution, 
also  known  as  the  Erlang-n  distribution  with  parameter  A. 

The  sum  of  independent  normal  random  variables 

Using  the  addition  rule  you  can  show  that  the  sum  of  two  independent  nor¬ 
mally  distributed  random  variables  is  again  a  normally  distributed  random 
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variable.  For  instance,  if  X  and  Y  are  independent  A^(0, 1)  distributed  random 
variables,  one  has 


/OO 

fx{z  -  y)fY{y)  dy 

-OO 


f 


g-i(2y=-2y2:+z=) 


dy. 


To  prepare  a  change  of  variables,  we  subtract  the  term  from  2y^  —  2yz  + 

to  complete  the  square  in  the  exponent: 

1 


2y^  -  2yz  +  -z^  = 


In  this  way  we  find  with  changing  integration  variables  t  =  V^iy  —  zl2): 


fx+Y{z) 


1 


1 

1 


poo  1 


-^(2y^-2yz-\-iz^) 


:  e  2 


e 


—  OO  V^^TT 

^  ^  dy 

1 


dy 


— OO  v^27r 

OO 


\/2tt  'J2  J —OO  \/2'k 

^=e"3^"  /  (j){t)dt. 

VdTT  J-oo 


e  2  dt 


Since  (j)  is  the  probability  density  of  the  standard  normal  distribution,  it  in¬ 
tegrates  to  1,  so  that 

fx+Y{z)  = 

VdTT 

which  is  the  probability  density  of  the  A^(0,  2)  distribution.  Thus,  X  +  Y  also 
has  a  normal  distribution.  This  is  more  generally  true. 


The  sum  of  independent  normal  random  variables.  If  X  and 
Y  are  independent  random  variables  with  a  normal  distribution,  then 
X  +  Y  also  has  a  normal  distribution. 

Quick  exercise  11.3  Let  X  and  Y  be  independent  random  variables,  where 
X  has  an  A^(3,16)  distribution,  and  Y  an  iV(5,9)  distribution.  Then  X  +  Y 
is  a  normally  distributed  random  variable.  What  are  its  parameters? 

Rather  surprisingly,  independence  of  X  and  Y  is  not  a  prerequisite,  as  can  be 
seen  in  the  following  remark. 
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Remark  11.2  (Sums  of  dependent  normal  random  variables).  We 

say  the  pair  X,  Y  is  has  a  bivariate  normal  distribution  if  their  joint  prob¬ 
ability  density  equals 

2.„  "’)  ■ 

where 

Here  fix  and  fiy  are  the  expectations  of  X  and  Y,  ctJ  s-nd  Eire  their 
variances,  and  p  is  the  correlation  coefficient  of  X  and  Y .  If  X  and  Y  have 
such  a  bivariate  normal  distribution,  then  X  has  an  N{px,  cix)  and  Y  has 
an  N{py,o-y)  distribution.  Moreover,  one  can  show  that  X  +  Y  has  an 
N(px  +  pY,o'x  +  ^Y  +  2p(Jxcry)  distribution.  An  example  of  a  bivariate 
normal  probability  density  is  displayed  in  Figure  9.2.  This  probability  den¬ 
sity  corresponds  to  parameters  px  =  Py  ~  0,  ax  =  ay  =  1/6,  and  p  =  0.8. 


11.3  Product  and  quotient  of  two  random  variables 

Recall  from  Chapter  7  the  example  of  the  architect  who  wants  maximal  vari¬ 
ety  in  the  sizes  of  buildings.  The  architect  wants  more  variety  and  therefore 
replaces  the  square  buildings  by  rectangular  buildings:  the  buildings  should 
be  of  width  X  and  depth  Y,  where  X  and  Y  are  independent  and  uniformly 
distributed  between  0  and  10  meters.  Since  X  and  Y  are  independent,  the 
expected  area  of  a  building  equals  E  [XY]  =  E  [X]  E  [F]  =  5  •  5  =  25  m^.  But 
what  can  one  say  about  the  distribution  of  the  area  Z  =  XY  of  an  arbitrary 
building? 

Let  us  calculate  the  distribution  function  of  Z.  Clearly  Fz{a)  =  0  if  a  <  0 
and  Fz{a)  =  1  if  a  >  100.  For  a  between  0  and  100  we  can  compute  Fz{a) 
with  the  help  of  Figure  11.3. 

We  find 


Fz{a)  =  P{Z  <a)  =  P{XY  <  a) 

area  of  the  shaded  region  in  Figure  11.3 
area  of  [0, 10]  x  [0, 10] 

flO 


a(l  -b  2  In  10  —  In  a) 
100  ■ 


Hence  the  probability  density  function  fz  of  Z  is  given  by 
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Fig.  11.3.  The  region  G  in  the  plane  where  xy  <  a  intersected  with  [0, 10]  x  [0, 10]. 


fz{z) 


d  z{l  +  2  In  10  —  Inz) 
dz  100 


In  100  -  In  z 
100 


for  0  <  z  <  100 m2. 

This  computation  can  be  generalized  to  arbitrary  independent  continuous 
random  variables,  and  we  obtain  the  following  formula  for  the  probability 
density  function  of  the  product  of  two  random  variables. 


Product  of  independent  continuous  random  variables.  Let 
X  and  Y  be  two  independent  continuous  random  variables  with  prob¬ 
ability  densities  fx  and  /y.  Then  the  probability  density  function 
fz  of  Z  =  XY  is  given  by 

fz{z)=[  fy  (-)  fx{x)-^dx 
J —oo  \x\ 

for  —00  <  z  <  oo. 


For  the  quotient  Z  =  X/Y  of  two  independent  random  variables  X  and 
Y  it  is  now  fairly  easy  to  derive  the  probability  density  function.  Since  the 
independence  of  X  and  Y  implies  that  X  and  l/Y  are  independent,  the 
preceding  rule  yields 

Recall  from  Section  8.2  that  the  probability  density  function  of  l/Y  is  given 

by 

/i/v(y)  = 

Vj/y 
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Substituting  this  in  the  integral,  after  changing  the  variable  of  integration,  we 
find  the  following  rule. 


Quotient  of  independent  continuous  random  variables. 
Let  X  and  Y  be  two  independent  continuous  random  variables  with 
probability  densities  fx  and  fy-  Then  the  probability  density  func¬ 
tion  fz  of  Z  =  X/Y  is  given  by 


fx{zx)fY{x)\x\  dx 


for  — oo  <  z  <  oo. 


The  quotient  of  two  independent  normal  random  variables 


Let  X  and  Y  be  independent  random  variables,  both  having  a  standard  normal 
distribution.  When  we  compute  the  quotient  Z  of  X  and  T,  we  find  a  so-called 
standard  Cauchy  distribution-. 


fz{z) 


dx 


1 

27r 

1 

TT 


Z^  +  1 


.e-i(C+i)Y 


This  is  the  special  case  a  =  0,  =  1 


1 

''dx  =  2-—  /  dx 

27r7o 

-  OO  ^ 

.  0  7r(z2  +  l)- 

of  the  following  family  of  distributions. 


Definition.  A  continuous  random  variable  has  a  Cauchy  distribu¬ 
tion  with  parameters  a  and  /?  >  0  if  its  probability  density  function  / 
is  given  by 


fix) 


P 

TT  (/32  -b  (x  —  a)2) 


for  —  oo  <  X  <  oo. 


We  denote  this  distribution  by  Cau{a,P). 


By  integrating,  we  find  that  the  distribution  function  L"  of  a  Cauchy  distri¬ 
bution  is  given  by 

/  X  1  1 

F(x)  =  -  -I —  arctan 
2  TT 

The  parameter  a  is  the  point  of  symmetry  of  the  probability  density  func¬ 
tion  /.  Note  that  a  is  not  the  expected  value  of  Z.  As  a  matter  of  fact,  it  was 
shown  in  Remark  7.1  that  the  expected  value  does  not  exist!  The  probabil¬ 
ity  density  /  is  shown  together  with  the  distribution  function  F  for  the  case 
a  =  2,  /?  =  5  in  Figure  11.4. 
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Fig.  11.4.  The  graphs  of  /  and  F  of  the  Cau{2,  5)  distribution. 

Quick  exercise  11.4  Argue — without  doing  any  calculations — that  if  Z  has 
a  standard  Cauchy  distribution,  1/Z  also  has  a  standard  Cauchy  distribution. 


11.4  Solutions  to  the  quick  exercises 


11.1  Using  the  addition  rule  we  find 


P(S'  =  3)  =  ^px(3  -  j)pYU) 

i=i 

=  px(2)pf(1)  +  px(1)pf(2)  +  px{0)py{^) 

+Px{-1)py{4:)  +px{-2)py{5)  +px(-3)py(6) 


and 


P(5'  =  8)  =  ^px(8  -  j)pY(j) 

i=i 

=  Px(7)py(l)  +px(6)py(2)  +  px{5)py{^) 
+px(4)py(4)  +px(3)py(5)  +px(2)py(6) 

J_  J_  J_  J_  J_-A 

36  36  36  36  36  “  36' 


11.2  We  have  seen  that  Ai  +  A2  is  a  Bin{ni  +n2,p)  distributed  random 
variable.  Viewing  Xi  +  X2  +  X3  as  the  sum  of  Xi  +  A2  and  X3,  it  follows 
that  Ai  +  A2  +  A3  is  a  Bin(ni  +  n2  +  n3,p)  distributed  random  variable. 
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11.3  The  sum  rule  for  two  normal  random  variables  tells  us  that  X  +  F  is 
a  normally  distributed  random  variable.  Its  parameters  are  expectation  and 
variance  of  X  +  Y.  Hence  by  linearity  of  expectations 

/xx+y  =  E[X  +  F]  =  E[X] +E[r]  =  MX  +  liy  =  3  +  5  =  8, 
and  by  the  rule  for  the  variance  of  the  sum 

=  Var(X)  +  Var(F)  +  2Cov(X,  F)  =  +  cr^  =  16  +  9  =  25, 

using  that  Cov(X,  F)  =  0  due  to  independence  of  X  and  F. 

11.4  In  the  examples  we  have  seen  that  the  quotient  X/Y  oi two  independent 
standard  normal  random  variables  has  a  standard  Cauchy  distribution.  Since 
Z  =  XjY ,  the  random  variable  1/Z  =  Y/ X .  This  is  also  the  quotient  of  two 
independent  standard  normal  random  variables,  and  it  has  a  standard  Cauchy 
distribution. 
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11.1  □  Let  X  and  F  be  independent  random  variables  with  a  discrete  uniform 
distribution,  i.e.,  with  probability  mass  functions 

Px{k)  =  PY{k)  = for  A;=  l,...,iV. 

Use  the  addition  rule  for  discrete  random  variables  on  page  152  to  determine 
the  probability  mass  function  of  Z  =  X  +  Y  for  the  following  two  cases. 


a.  Suppose  TV  =  6,  so  that  X  and  F  represent  two  throws  with  a  die.  Show 


that 


pz(k)=P(X  +  Y  =  k) 


k-1 

36 

13 -fc 
36 


for  fc  =  2, . . . ,  6, 
for  fc  =  7, . . . ,  12. 


You  may  check  this  with  Quick  exercise  11.1. 


b.  Determine  the  expression  for  pz{k)  for  general  N. 


11.2  ffl  Consider  a  discrete  random  variable  X  taking  values  fc  =  0, 1,  2, . . . 
with  probabilities 

P(X  =  fc)  =  ^e-^ 

where  p  >  0.  This  is  the  Poisson  distribution  with  parameter  p.  We  will  learn 
more  about  this  distribution  in  Chapter  12.  This  exercise  illustrates  that  the 
sum  of  independent  Poisson  variables  again  has  a  Poisson  distribution. 
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a.  Let  X  and  Y  be  independent  random  variables,  each  having  a  Poisson 
distribution  with  ^  =  1.  Show  that  for  A:  =  0, 1, 2, . . . 

Y{X  +  Y  =  k)  =  ye-\ 


by  using  X:Lo  if)  =  2^ 

b.  Let  X  and  Y  be  independent  random  variables,  each  having  a  Poisson 
distribution  with  parameters  A  and  /r.  Show  that  for  fc  =  0, 1,  2, . . . 

P{X  +  Y  =  k)= 

by  using  Eto  (")/(!  "  P)"”'  =  1  ^r  p  =  p/{X  +  p). 

We  conclude  that  X  +  Y  has  a  Poisson  distribution  with  parameter  A  +  /i. 

11.3  Let  X  and  Y  be  two  independent  random  variables,  where  X  has  a 
Ber{p)  distribution,  and  Y  has  a  Ber{q)  distribution.  When  p  =  q  =  r,  we 
know  that  X  +  Y  has  a  Bin{2,r)  distribution.  Suppose  that  p  =  1/2  and 
q  =  1/4.  Determine  P(X  +  Y  =  fc),  for  fc  =  0, 1,  2,  and  conclude  that  X  +  Y 
does  not  have  a  binomial  distribution. 

11.4  ffl  Let  X  and  Y  be  two  independent  random  variables,  where  X  has  an 
N(2, 5)  distribution  and  Y  has  an  iV(5,  9)  distribution.  Define  Z  =  SX—2Y+1. 

a.  Compute  Y[Z]  and  Var(Z). 

b.  What  is  the  distribution  of  Z7 

c.  Compute  P{Z  <  6). 

11.5  □  Let  X  and  Y  be  two  independent,  t/(0,l)  distributed  random  vari¬ 
ables.  Use  the  rule  on  addition  of  independent  continuous  random  variables 
on  page  156  to  show  that  the  probability  density  function  of  X  -|-  U  is  given 
by 

{z  for  0  <  z  <  1, 

2  -  z  for  1  <  z  <  2, 

0  otherwise. 

11.6  □  Let  X  and  Y  be  independent  random  variables  with  probability  den¬ 
sities 

fx{x)  =  \xe~'^l'^  and  friy)  = 

Use  the  rule  on  addition  of  independent  continuous  random  variables  to  de¬ 
termine  the  probability  density  of  Z  =  X  +  Y . 

11.7  □  The  two  random  variables  in  Exercise  11.6  are  special  cases  of 
Gam{a,X)  variables,  namely  with  a  =  2  and  A  =  1/2.  More  generally,  let 
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Xi, . . . ,  Xn  be  independent  Gam{k,X)  distributed  random  variables,  where 
A  >  0  and  fc  is  a  positive  integer.  Argue — without  doing  any  calculations — 
that  Xi  +  ■  ■  ■  +  Xn  has  a  Gam{nk,  A)  distribution. 

11.8  We  investigate  the  effect  on  the  Cauchy  distribution  under  a  change  of 
units. 

a.  Let  X  have  a  standard  Cauchy  distribution.  What  is  the  distribution  of 
Y  =  rX  +  s7 

b.  Let  X  have  a  Gau{a,f3)  distribution.  What  is  the  distribution  of  the 
random  variable  {X  —  a)  j /31 

11.9  ffl  Let  X  and  Y  be  independent  random  variables  with  a  Par  (a)  and 
Par {(3)  distribution. 

a.  Take  a  =  3  and  /3  =  1  and  determine  the  probability  density  oi  Z  =  XY . 

b.  Determine  the  probability  density  oi  Z  =  XY  for  general  a  and  (3. 

11.10  Let  X  and  Y  be  independent  random  variables  with  a  Par{a)  and 
Par{(3)  distribution. 

a.  Take  a  =  f3  =  2.  Show  that  Z  =  X/Y  has  probability  density 


2;  for  0  <  z  <  1, 
1/z^  for  1  <  z  <  00. 


b.  For  general  a,/3  >  0,  show  that  Z  =  X/Y  has  probability  density 


for  0  <  z  <  1, 


for  1  <  z  <  00. 


11.11  Let  Xi,  X2,  and  A3  be  three  independent  Geo(j>)  distributed  random 
variables,  and  let  Z  =  Xi  +  X2  +  A3. 

a.  Show  for  fc  >  3  that  the  probability  mass  function  pz  of  Z  is  given  by 


pz{k)  =  P(Ai  +  A2  +  A3  =  fc)  =  i(fc  -  2)(fc  -  l)p3(l  _  p)fe-3. 


b.  Use  the  fact  that  J2T=3Pz{k)  =  1  to  show  that 


(e[a2]  +E[Ai])  =2. 

c.  Use  E[Ai]  =  1/p  and  part  b  to  conclude  that 


E  [a/]  =  ^  and  Var(Ai)  = 

L  J  pZ  p2 
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11.12  Show  that  r(l)  =  1,  and  use  integration  by  parts  to  show  that 

r(a;  +  1)  =  a;r(a;)  for  a;  >  0. 

Use  this  last  expression  to  show  for  n  =  1,2,...  that 

r(n)  =  (n  —  1)! 

11.13  Let  Zn  have  an  Erlang-n  distribution  with  parameter  A. 

a.  Use  integration  by  parts  to  show  that  for  a  >  0  and  n>  2: 


F{Zn  <a)=f 


a  yi^n-l^-Xz 


(n  —  1)! 


b.  Use  a  to  show  that  for  a  >  0: 


c.  Conclude  that  for  a  >  0: 


12 


The  Poisson  process 


In  many  random  phenomena  we  encounter,  it  is  not  just  one  or  two  random 
variables  that  play  a  role  but  a  whole  collection.  In  that  case  one  often  speaks 
of  a  random  process.  The  Poisson  process  is  a  simple  kind  of  random  process, 
which  models  the  occurrence  of  random  points  in  time  or  space.  There  are 
numerous  ways  in  which  processes  of  random  points  arise:  some  examples  are 
presented  in  the  first  section.  The  Poisson  process  describes  in  a  certain  sense 
the  most  random  way  to  distribute  points  in  time  or  space.  This  is  made  more 
precise  with  the  notions  of  homogeneity  and  independence. 


12.1  Random  points 

Typical  examples  of  the  occurrence  of  random  time  points  are:  arrival  times 
of  email  messages  at  a  server,  the  times  at  which  asteroids  hit  the  earth, 
arrival  times  of  radioactive  particles  at  a  Geiger  counter,  times  at  which  your 
computer  crashes,  the  times  at  which  electronic  components  fail,  and  arrival 
times  of  people  at  a  pump  in  an  oasis. 

Examples  of  the  occurrence  of  random  points  in  space  are:  the  locations  of 
asteroid  impacts  with  earth  (2-dimensional),  the  locations  of  imperfections  in  a 
material  (3-dimensional),  and  the  locations  of  trees  in  a  forest  (2-dimensional). 

Some  of  these  phenomena  are  better  modeled  by  the  Poisson  process  than 
others.  Loosely  speaking,  one  might  say  that  the  Poisson  process  model  often 
applies  in  situations  where  there  is  a  very  large  population,  and  each  member 
of  the  population  has  a  very  small  probability  to  produce  a  point  of  the 
process.  This  is,  for  instance,  well  fulfilled  in  the  Geiger  counter  example 
where,  in  a  huge  collection  of  atoms,  just  a  few  will  emit  a  radioactive  particle 
(see  [28]).  A  property  of  the  Poisson  process — as  we  will  see  shortly — is  that 
points  may  lie  arbitrarily  close  together.  Therefore  the  tree  locations  are  not 
so  well  modeled  by  the  Poisson  process. 
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12.2  Taking  a  closer  look  at  random  arrivals 

A  well-known  example  that  is  usually  modeled  by  the  Poisson  process  is  that 
of  calls  arriving  at  a  telephone  exchange — the  exchange  is  connected  to  a  large 
number  of  people  who  make  phone  calls  now  and  then.  This  will  be  our  leading 
example  in  this  section. 

Telephone  calls  arrive  at  random  times  Xi,  X2, ...  at  the  telephone  exchange 
during  a  time  interval  [0,t]. 


Time 

I - * - * — * - * - * - ^ - > 

0  Xi  X2  X3  X4  X5  t 


The  two  basic  assumptions  we  make  on  these  random  arrivals  are 

1.  {Homogeneity)  The  rate  A  at  which  arrivals  occur  is  constant  over  time: 
in  a  subinterval  of  length  u  the  expectation  of  the  number  of  telephone 
calls  is  Xu. 

2.  {Independence)  The  numbers  of  arrivals  in  disjoint  time  intervals  are  in¬ 
dependent  random  variables. 

Homogeneity  is  also  called  weak  stationarity.  We  denote  the  total  number  of 
calls  in  an  interval  /  by  N{I),  abbreviating  iV([0,  t])  to  Nt-  Homogeneity  then 
implies  that  we  require 

E[iVt]  =  At. 

To  get  hold  of  the  distribution  of  Nt  we  divide  the  interval  [0,  t]  into  n  intervals 
of  length  t/n.  When  n  is  large  enough,  every  interval  Ij^n  =  {{j  —  1)  t/n,j  t/n] 
will  contain  either  0  or  1  arrival:  For  such  a  large  n  (which  also  satisfies 

Time 

I - j - ^ ^ ^ - ^1  >1  1^ - N - ^ ^ ^ 

0  -  Xi  A2  Xa  X4  X5  (n-  1)-  t 

n  n 

n  >  At),  let  Rj  be  the  number  of  arrivals  in  the  time  interval  Ij^n-  Since  Rj  is 
0  or  1,  Rj  has  a  Ber{pj)  distribution  for  some  pj.  Recall  that  for  a  Bernoulli 
random  variable  =  0  ■  {1  —  pj)  +  1  ■  pj  =  Pj.  By  the  homogeneity 

assumption,  for  each  j 

At 

Pi  =  X  ■  length  of  „  =  — . 

n 

Summing  the  number  of  calls  in  the  intervals  gives  the  total  number  of  calls, 
hence 


Nt  —  i?l  -|-  i?2  Rn  ■ 
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By  the  independence  assumption,  the  Rj  are  independent  random  variables, 
therefore  Nt  has  a  Bin{n,p)  distribution,  with  p  =  Xt/n. 


Remark  12.1  (About  this  approximation).  The  argument  just  given 
seems  pretty  convincing,  but  actually  Rj  does  not  have  a  Bernoulli  distri¬ 
bution,  whatever  the  value  of  n.  A  way  to  see  this  is  the  following.  Every 
interval  is  a  union  of  the  two  intervals  and  l2j,2n-  Hence  the 

probability  that  Ij^n  contains  two  calls  is  at  least  {Xtf2n)^  = 
which  is  larger  than  zero. 

Note  however,  that  the  probability  of  having  two  arrivals  is  of  smaller  order 
than  the  probability  that  Rj  takes  the  value  1.  If  we  add  a  third  assumption, 
namely  that  the  probability  of  two  or  more  calls  arriving  in  an  interval 
tends  to  zero  faster  than  1/n,  then  the  conclusion  below  on  the  distribution 
of  Nt  is  valid. 

We  have  found  that  (at  least  in  first  approximation) 

=  (l)  (f)  (‘-^)  !ork  =  0.....n. 


In  this  analysis  n  is  a  rather  artificial  parameter,  of  which  we  only  know  that 
it  should  not  be  “too  small.”  It  therefore  seems  a  good  idea  to  get  rid  of  n 
by  letting  n  go  to  infinity,  hoping  that  the  probability  distribution  of  Nt  will 
settle  down.  Note  that 


lim 


n\  1  n  n—1  (n  —  k+1)  1  1 

=  l.m - — 


n— >oo  \k J  ri^oo  n  U 

and  from  calculus  we  know  that 

Xt 


lim  I  1 - I  =  e 

n 


—  At 


Since  certainly 


Xt 


lim  I  1 - I  =  1 

n 


—  k 


we  obtain,  combining  these  three  limits,  that 


lim  P(iVt  =  k) 

n — ^oo 


lim 

n — ^oo 


m\_,t 

k\ 


Since 


^—Xt 


E 

k—O 


kl 


=  1, 


we  have  indeed  run  into  a  probability  distribution  on  the  numbers  0, 1,  2, . . .  . 
Note  that  all  these  probabilities  are  determined  by  the  single  value  Xt.  This 
motivates  the  following  definition. 
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Definition.  A  discrete  random  variable  X  has  a  Poisson  distribu¬ 
tion  with  parameter  /r,  where  ^  >  0  if  its  probability  mass  function  p 
is  given  by 

p(k)  =  P(X  =  k)  =  ^  for  fc  =  0, 1, 2, . . . . 

k\ 

We  denote  this  distribution  by  Pois{p). 


Figure  12.1  displays  the  graphs  of  the  probability  mass  functions  of  the  Poisson 
distribution  with  p  =  0.9  (left)  and  the  Poisson  distribution  with  /i  =  5 
(right). 
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Fig.  12.1.  The  probability  mass  functions  of  the  Pois (0.9)  and  the  Pois{5)  distri¬ 
butions. 


Quick  exercise  12.1  Consider  the  event  “exactly  one  call  arrives  in  the 
interval  [0,  2s].”  The  probability  of  this  event  is  P(A^2s  =  1)  =  A  •  2s  • 

But  note  that  this  event  is  the  same  as  “there  is  exactly  one  call  in  the  interval 
[0,  s)  and  no  calls  in  the  interval  [s,  2s],  or  no  calls  in  [0,  s)  and  exactly  one  call 
in  [s,2s].”  Verify  (using  assumptions  1  and  2)  that  you  get  the  same  answer 
if  you  compute  the  probability  of  the  event  in  this  way. 

We  do  have  a  hint^  about  what  the  expectation  and  variance  of  a  Poisson 
random  variable  might  be:  since  E[A^t]  =  Xt  for  all  n,  we  anticipate  that  the 
limiting  Poisson  distribution  will  have  expectation  Xt.  Similarly,  since  Nt  has 
a  Bin{n,  ^)  distribution,  we  anticipate  that  the  variance  will  be 

^  This  is  really  not  more  than  a  hint:  there  are  simple  examples  where  the  distribu¬ 
tions  of  random  variables  converge  to  a  distribution  whose  expectation  is  different 
from  the  limit  of  the  expectations  of  the  distributions!  (cf.  Exercise  12.14). 


12.3  The  one-dimensional  Poisson  process 
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lim  Var(iVt) 


r 

lim  n  •  —  • 


n 


=  At. 


Actually,  the  expectation  of  a  Poisson  random  variable  X  with  parameter  /r 
is  easy  to  compute: 


OO  oo  u 


k-l 


k^l 


(fc-l)! 


i=o 


In  a  similar  way  the  variance  can  be  determined  (see  Exercise  12.8),  and  we 
arrive  at  the  following  rule. 


The  expectation  and  variance  of  a  Poisson  distribution. 
Let  X  have  a  Poisson  distribution  with  parameter  then 

E  [X]  =  /i  and  Var(X)  = 


12.3  The  one-dimensional  Poisson  process 


We  will  derive  some  properties  of  the  sequence  of  random  points  W,  1^2,  ■  •  ■ 
that  we  considered  in  the  previous  section.  What  we  derived  so  far  is  that  for 
any  interval  (s,  s  -I-  t]  the  number  s  -|-  t])  of  points  Xi  in  that  interval  is 

a  random  variable  with  a  Pois{Xt)  distribution. 


Interarrival  times 


The  differences 


T,  =  W  - 


are  called  interarrival  times.  Here  we  define  Ti  =  Xi,  the  time  of  the  first 
arrival.  To  determine  the  probability  distribution  of  Ti,  we  observe  that  the 
event  {Ti  >  t}  that  the  first  call  arrives  after  time  t  is  the  same  as  the  event 
{Nt  =  0}  that  no  calls  have  been  made  in  [0,  t].  But  this  implies  that 


P{Ti  <t)  =  l-  P(Ti  >  t)  =  1  -  P{Nt  =  0)  =  1  -  e"^*. 


Therefore  Ti  has  an  exponential  distribution  with  parameter  A. 

To  compute  the  joint  distribution  of  Ti  and  T2,  we  consider  the  conditional 
probability  that  T2  >  t,  given  that  Ti  =  s,  and  use  the  property  that  arrivals 
in  different  intervals  are  independent: 
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P(r2  >  t\Ti  =  s)  =  P(no  arrivals  in  (s,  s  +  i]  |  Ti  =  s) 

=  P(no  arrivals  in  (s,  s  +  i]) 

=  P{N{{s,s  +  t])  =  0)=e-^K 

Since  this  answer  does  not  depend  on  s,  we  conclude  that  Ti  and  T2  are 
independent,  and 

P(r2  >t)  =  e-^*, 

i.e.,  T2  also  has  an  exponential  distribution  with  parameter  A.  Actually,  al¬ 
though  the  conclusion  is  correct,  the  method  to  derive  it  is  not,  because  we 
conditioned  on  the  event  {Ti  =  s},  which  has  zero  probability.  This  problem 
could  be  circumvented  by  conditioning  on  the  event  that  Ti  lies  in  some  small 
interval,  but  that  will  not  be  done  here.  Analogously,  one  can  show  that  the  Ti 
are  independent  and  have  an  Exp{X)  distribution.  This  nice  property  allows 
us  to  give  a  simple  definition  of  the  one-dimensional  Poisson  process. 


Definition.  The  one-dimensional  Poisson  process  with  intensity  A 
is  a  sequence  Ai,  A2,  Ala, ...  of  random  variables  having  the  property 
that  the  interarrival  times  Xi ,  X2  —Xi,  A3  —  A2 , . . .  are  independent 
random  variables,  each  with  an  Exp{X)  distribution. 


Note  that  the  connection  with  Nt  is  as  follows:  Nt  is  equal  to  the  number  of 
Xi  that  are  smaller  than  (or  equal  to)  t. 

Quick  exercise  12.2  We  model  the  arrivals  of  email  messages  at  a  server  as 
a  Poisson  process.  Suppose  that  on  average  330  messages  arrive  per  minute. 
What  would  you  choose  for  the  intensity  A  in  messages  per  second?  What  is 
the  expectation  of  the  interarrival  time? 

An  obvious  question  is:  what  is  the  distribution  of  Xi?  This  has  already  been 
answered  in  Chapter  11:  since  Xi  is  a  sum  of  i  independent  exponentially 
distributed  random  variables,  we  have  the  following. 


The  points  of  the  Poisson  process.  For  z  =  1, 2, . . .  the  random 
variable  Xi  has  a  Gam{i,X)  distribution. 


The  distribution  of  points 

Another  interesting  question  is:  if  we  know  that  n  points  are  generated  in  an 
interval,  where  do  these  points  lie?  Since  the  distribution  of  the  number  of 
points  only  depends  on  the  length  of  the  interval,  and  not  on  its  location,  it 
suffices  to  determine  this  for  an  interval  starting  at  0.  Let  this  interval  be  [0,  a]. 
We  start  with  the  simplest  case,  where  there  is  one  point  in  [0,a]:  suppose 
that  A([0,  a])  =  1.  Then,  for  0  <  s  <  a: 
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P{Xi<s\N{[0,a])  =  l) 


P(Xi<s,iV([0,a])  =  l) 
P(lV([0,a])  =  l) 
P(iV([0,5])  =  l,jV((5,a])  =  0) 
P(iV([0,a])  =  l) 

Xae~^°‘ 

s 

a 


We  find  that  conditional  on  the  event  {7V([0,  a])  =  1},  the  random  variable 
is  uniformly  distributed  over  the  interval  [0,  a]. 

Now  suppose  that  it  is  given  that  there  are  two  points  in  [0,a]:  7V([0,  a])  = 
2.  In  a  way  similar  to  what  we  did  for  one  point,  we  can  show  that  (see 
Exercise  12.12) 


P(Xi  <s,X2<t\  iV([0,a])  =  2)  =  . 

Now  recall  the  result  of  Exercise  9.17:  if  Ui  and  U2  are  two  independent 
random  variables,  both  uniformly  distributed  over  [0,  o],  then  the  joint  distri¬ 
bution  function  of  P  =  min(C/i,  U2)  and  Z  =  max(C/i,  U2)  is  given  by 

—  (t  — 

P(P  <  a,Z  <t)  =  - - -  for  0  <  s  <  t  <  a. 

Thus  we  have  found  that,  if  we  forget  about  their  order,  the  two  points  in 
[0,a]  are  independent  and  uniformly  distributed  over  [0,a].  With  somewhat 
more  work,  this  generalizes  to  an  arbitrary  number  of  points,  and  we  arrive 
at  the  following  property. 


Location  of  the  points,  given  their  number.  Given  that 
the  Poisson  process  has  n  points  in  the  interval  [a,  6],  the  locations 
of  these  points  are  independently  distributed,  each  with  a  uniform 
distribution  on  [a,b]. 


12.4  Higher-dimensional  Poisson  processes 

Our  definition  of  the  one-dimensional  Poisson  process,  starting  with  the  in¬ 
terarrival  times,  does  not  generalize  easily,  because  it  is  based  on  the  ordering 
of  the  real  numbers.  However,  we  can  easily  extend  the  assumptions  of  inde¬ 
pendence,  homogeneity,  and  the  Poisson  distribution  property.  To  do  this  we 
need  a  higher-dimensional  version  of  the  concept  of  length.  We  denote  the  k- 
dimensional  volume  of  a  set  A  in  fc-dimensional  space  by  ■m(A).  For  instance, 
in  the  plane  m{A)  is  the  area  of  A,  and  in  space  m{A)  is  the  volume  of  A. 
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Definition.  The  /c-dimensional  Poisson  process  with  intensity  A 
is  a  collection  Xi,  X2,  ...  of  random  points  having  the  property 

that  if  N (A)  denotes  the  number  of  points  in  the  set  A,  then 

1.  {Homogeneity)  The  random  variable  N{A)  has  a  Poisson  distri¬ 
bution  with  parameter  Xm{A). 

2.  {Independence)  For  disjoint  sets  Ai,  A2, . . . ,  An  the  random  vari¬ 
ables  N{Ai),  N{A2),  . . .  ,N{An)  are  independent. 

Quick  exercise  12.3  Suppose  that  the  locations  of  defects  in  a  certain  type  of 
material  follow  the  two-dimensional  Poisson  process  model.  For  this  material 
it  is  known  that  it  contains  on  average  five  defects  per  square  meter.  What  is 
the  probability  that  a  strip  of  length  2  meters  and  width  5  cm  will  be  without 
defects? 

In  Figure  7.4  the  locations  of  the  buildings  the  architect  wanted  to  distribute 
over  a  lOO-by-300-m  terrain  have  been  generated  by  a  two-dimensional  Poisson 
process.  This  has  been  done  in  the  following  way.  One  can  again  show  that 
given  the  total  number  of  points  in  a  set,  these  points  are  uniformly  distributed 
over  the  set.  This  leads  to  the  following  procedure:  first  one  generates  a  value 
n  from  a  Poisson  distribution  with  the  appropriate  parameter  (A  times  the 
area) ,  then  one  generates  n  times  a  point  uniformly  distributed  over  the  100- 
by-300  rectangle. 

Actually  one  can  generate  a  higher-dimensional  Poisson  process  in  a  way  that 
is  very  similar  to  the  natural  way  this  can  be  done  for  the  one-dimensional 
process.  Directly  from  the  definition  of  the  one-dimensional  process  we  see 
that  it  can  be  obtained  by  consecutively  generating  points  with  exponentially 
distributed  gaps.  We  will  explain  a  similar  procedure  for  dimension  two.  For 
s  >  0,  let 

M,  =  N{C,), 

where  Cs  is  the  circular  region  of  radius  s,  centered  at  the  origin.  Since  Cg 
has  area  Mg  has  a  Poisson  distribution  with  parameter  Atts^.  Let  Ri 
denote  the  distance  of  the  ith  closest  point  to  the  origin.  This  is  illustrated 
in  Figure  12.2. 

Note  that  Ri  is  the  analogue  of  the  ith  arrival  time  for  the  one-dimensional 
Poisson  process:  we  have  in  fact  that 

Ri  <  s  if  and  only  if  Mg  >  i. 

In  particular,  with  i  =  1  and  s  =  v^, 

P{Rl  <t)  =  p(^Ri<Vtj  =  P(M^  >  0)  =  1  - 

In  other  words:  Ri  is  Exp{Xtt)  distributed.  For  general  i,  we  can  similarly 
write 

P(i?2  <  t)  =  P(i?,  <  Vi)  =  P(M^  >  i) . 
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Fig.  12.2.  The  Poisson  process  in  the  plane,  with  the  two  circles  of  the  two  points 
closest  to  the  origin. 


So 


i-l 


3=0 


{x-Kty 


which  means  that  i?f  has  a  Gam{i,  Att)  distribution — as  we  saw  on  page  157. 
Since  gamma  distributions  arise  as  sums  of  independent  exponential  distribu¬ 
tions,  we  can  also  write 


—  Ri-i  +  Ti, 


where  the  Ti  are  independent  Exp{Xtt)  random  variables  (and  where  Rq  =  0). 
Note  that  this  is  quite  similar  to  the  one-dimensional  case.  To  simulate  the 
two-dimensional  Poisson  process  from  a  sequence  Ui,U2,  ■  ■  ■  of  independent 
C/(0,1)  random  variables,  one  can  therefore  proceed  as  follows  (recall  from 
Section  6.2  that  — (1/A)  ln(?7i)  has  an  Exp{X)  distribution):  for  i  =  1,2,... 
put 

R^  =  ^jRl,-^HU2y; 

this  gives  the  distance  of  the  ith  point  to  the  origin,  and  then  put  the  point 
on  this  circle  according  to  an  angle  value  generated  by  27rU2i-i-  This  is  the 
correct  way  to  do  it,  because  one  can  show  that  in  polar  coordinates  the  radius 
and  the  angle  of  a  Poisson  process  point  are  independent  of  each  other,  and 
the  angle  is  uniformly  distributed  over  [0,  27r].  The  latter  is  called  the  isotropy 
property  of  the  Poisson  process. 
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12.5  Solutions  to  the  quick  exercises 

12.1  The  probability  of  exactly  one  call  in  [0,  s)  and  no  calls  in  [s,  2s]  equals 

P(iV([0,s))  =  l,iV([s,2s])  =  0)  =  P(iV([0,s))  =  l)P(iV([s,2s])  =  0) 

=  P(iV([0,s))  =  l)P(lV([0,s])  =  0) 

=  Ase"^®  -e"^®, 

because  of  independence  and  homogeneity.  In  the  same  way,  the  probability 
of  exactly  one  call  in  [s,  2s]  and  no  calls  in  [0,  s)  is  equal  to  e“'^®  •  Ase”'*'®.  And 
indeed:  Ase“^®  •  e“^®  +  e“^®  •  Ase“^®  =  2Ase“^'^®. 

12.2  Because  there  are  60  seconds  in  a  minute,  we  have  60A  =  330.  It  follows 
that  A  =  5^.  Since  the  interarrival  times  have  an  Exp{X)  distribution,  the 
expected  time  between  messages  is  1/A  =  0.18  second. 

12.3  The  intensity  of  this  process  is  A  =  5  per  m^.  The  area  of  the  strip  is 
2  •  (1/20)  =  1/lOm^.  Hence  the  probability  that  no  defects  occur  in  the  strip 

ig  g-A-(areaof  strip)  _  g-5-(l/10)  _  g-1/2  =  Q  0Q 


12.6  Exercises 

12.1  ffl  In  each  of  the  following  examples,  try  to  indicate  whether  the  Poisson 
process  would  be  a  good  model. 

a.  The  times  of  bankruptcy  of  enterprises  in  the  United  States. 

b.  The  times  a  chicken  lays  its  eggs. 

c.  The  times  of  airplane  crashes  in  a  worldwide  registration. 

d.  The  locations  of  worngly  spelled  words  in  a  book. 

e.  The  times  of  traffic  accidents  at  a  crossroad. 

12.2  The  number  of  customers  that  visit  a  bank  on  a  day  is  modeled  by  a 
Poisson  distribution.  It  is  known  that  the  probability  of  no  customers  at  all 
is  0.00001.  What  is  the  expected  number  of  customers? 

12.3  Let  N  have  a  Pois(4)  distribution.  What  is  P(A^  =  4)? 

12.4  Let  X  have  a  Pois(2)  distribution.  What  is  P(A  <  1)? 

12.5  □  The  number  of  errors  on  a  hard  disk  is  modeled  as  a  Poisson  random 
variable  with  expectation  one  error  in  every  Mb,  that  is,  in  every  2^^  bytes. 

a.  What  is  the  probability  of  at  least  one  error  in  a  sector  of  512  bytes? 

b.  The  hard  disk  is  an  18.62-Gb  disk  drive  with  39  054015  sectors.  What  is 
the  probability  of  at  least  one  error  on  the  hard  disk? 
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12.6  ffl  A  certain  brand  of  copper  wire  has  flaws  about  every  40  centimeters. 
Model  the  locations  of  the  flaws  as  a  Poisson  process.  What  is  the  probability 
of  two  flaws  in  1  meter  of  wire? 

12.7  ffl  The  Poisson  model  is  sometimes  used  to  study  the  flow  of  traffic  ([15]). 
If  the  traffic  can  flow  freely,  it  behaves  like  a  Poisson  process.  A  20-minute 
time  interval  is  divided  into  10-second  time  slots.  At  a  certain  point  along  the 
highway  the  number  of  passing  cars  is  registered  for  each  10-second  time  slot. 
Let  rij  be  the  number  of  slots  in  which  j  cars  have  passed  for  j  =  0, . . . ,  9. 
Suppose  that  one  finds 

j  0  1  2  3456789 

Uj  19  38  28  20  7  3  4  0  0  1 

Note  that  the  total  number  of  cars  passing  in  these  20  minutes  is  230. 

a.  What  would  you  choose  for  the  intensity  parameter  A? 

b.  Suppose  one  estimates  the  probability  of  0  cars  passing  in  a  10-second 
time  slot  by  no  divided  by  the  total  number  of  time  slots.  Does  that 
(reasonably)  agree  with  the  value  that  follows  from  your  answer  in  a? 

c.  What  would  you  take  for  the  probability  that  10  cars  pass  in  a  10-second 
time  slot? 

12.8  □  Let  A  be  a  Poisson  random  variable  with  parameter  /i. 

a.  Compute  E[A(A  —  1)]. 

b.  Compute  Var(A),  using  that  Var(A)  =  E[A(A  —  1)]  -|-  E[A]  —  (E[A])^. 

12.9  Let  Yi  and  Y2  be  independent  Poisson  random  variables  with  parameter 
/ri,  respectively  fi2-  Show  that  Y  =  Yi  +Y2  also  has  a  Poisson  distribution. 
Instead  of  using  the  addition  rule  in  Section  11.1  as  in  Exercise  11.2,  you 
can  prove  this  without  doing  any  computations  by  considering  the  number 
of  points  of  a  Poisson  process  (with  intensity  1)  in  two  disjoint  intervals  of 
length  /ii  and  /i2- 

12.10  Let  A  be  a  random  variable  with  a  Pois{ii)  distribution.  Show  the 
following.  If  /i  <  1,  then  the  probabilities  P(A  =  k)  are  strictly  decreasing 
in  fc.  If  /i  >  1,  then  the  probabilities  P(A  =  k)  are  first  increasing,  then 
decreasing  (cf.  Figure  12.1).  What  happens  if  /i  =  1? 

12.11  ffl  Consider  the  one-dimensional  Poisson  process  with  intensity  A.  Show 
that  the  number  of  points  in  [0,t],  given  that  the  number  of  points  in  [0,2t] 
is  equal  to  n,  has  a  Bin{n,  i)  distribution. 

Hint:  write  the  event  {A([0,  sj)  =  k,  A([0,  2s])  =  n}  as  the  intersection  of  the 
(independent!)  events  {A([0,s])  =  k}  and  {A((s,2s])  =n  —  k}. 
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12.12  We  consider  the  one-dimensional  Poisson  process.  Suppose  for  some 
a  >  0  it  is  given  that  there  are  exactly  two  points  in  [0, a],  or  in  other  words: 
Na  =  2.  The  goal  of  this  exercise  is  to  determine  the  joint  distribution  of  Xi 
and  X2,  the  locations  of  the  two  points,  conditional  on  Na  =  2. 

a.  Prove  that  for  0  <  s  <  t  <  a 


P{Xi<S,X2<t,Na  =  2) 

=  P(X2  <t,Na  =  2)-  P{Xi  >S,X2<  t,  Na  =  2). 


b.  Deduce  from  a  that 


P(Xi  <  s,  X2  <  t,  Na 


2)  =  e-^“ 


c.  Deduce  from  b  that  for  0  <  s  <  t  <  a 

P  -  (t- 

P(^i  <s,X2<t\Na  =  2)  =  - ^ - L, 

12.13  Walking  through  a  meadow  we  encounter  two  kinds  of  flowers,  daisies 
and  dandelions.  As  we  walk  in  a  straight  line,  we  model  the  positions  of  the 
flowers  we  encounter  with  a  one-dimensional  Poisson  process  with  intensity  A. 
It  appears  that  about  one  in  every  four  flowers  is  a  daisy.  Forgetting  about 
the  dandelions,  what  does  the  process  of  the  daisies  look  like?  This  question 
will  be  answered  with  the  following  steps. 


a.  Let  Nt  be  the  total  number  of  flowers,  Xt  the  number  of  daisies,  and  Yt 
be  the  number  of  dandelions  we  encounter  during  the  first  t  minutes  of 
our  walk.  Note  that  Xt  +  Yt  =  Nt-  Suppose  that  each  flower  is  a  daisy 
with  probability  1/4,  independent  of  the  other  flowers.  Argue  that 


P  (Xt  =  n,Yt  =  m\Nt  =  n  +  m) 


b.  Show  that 

by  conditioning  on  Nt  and  using  a. 

c.  By  writing  e“^*  =  and  summing  over  m,  show  that 


Since  it  is  clear  that  the  numbers  of  daisies  that  we  encounter  in  disjoint  time 
intervals  are  independent,  we  may  conclude  from  c  that  the  process  {Xt)  is 
again  a  Poisson  process,  with  intensity  A/4.  One  often  says  that  the  process 
{Xt)  is  obtained  by  thinning  the  process  {Nt).  In  our  example  this  corresponds 
to  picking  all  the  dandelions. 


12.6  Exercises 
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12.14  □  In  this  exercise  we  look  at  a  simple  example  of  random  variables 
that  have  the  property  that  their  distributions  converge  to  the  distribution  of 
a  random  variable  X  as  n  oo,  while  it  is  not  true  that  their  expectations 
converge  to  the  expectation  of  X.  Let  for  n  =  1,  2, . . .  the  random  variables 
Xn  be  defined  by 

P(X„  =  0)  =  1--  and  P(X„  =  7n)  =  -. 

n  n 

a.  Let  X  be  the  random  variable  that  is  equal  to  0  with  probability  1.  Show 
that  for  all  a  the  probability  mass  functions  px„  (a)  of  the  converge  to 
the  probability  mass  function  px{o,)  of  X  as  n  — >  oo.  Note  that  E[X]=0. 

b.  Show  that  nonetheless  E  [X„]  =  7  for  all  n. 
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The  law  of  large  numbers 


For  many  experiments  and  observations  concerning  natural  phenomena — such 
as  measuring  the  speed  of  light — one  finds  that  performing  the  procedure  twice 
under  (what  seem)  identical  conditions  results  in  two  different  outcomes.  Un¬ 
controllable  factors  cause  “random”  variation.  In  practice  one  tries  to  over¬ 
come  this  as  follows:  the  experiment  is  repeated  a  number  of  times  and  the 
results  are  averaged  in  some  way.  In  this  chapter  we  will  see  why  this  works  so 
well,  using  a  model  for  repeated  measurements.  We  view  them  as  a  sequence 
of  independent  random  variables,  each  with  the  same  unknown  distribution. 
It  is  a  probabilistic  fact  that  from  such  a  sequence — in  principle — any  feature 
of  the  distribution  can  be  recovered.  This  is  a  consequence  of  the  law  of  large 
numbers. 


13.1  Averages  vary  less 

Scientists  and  engineers  involved  in  experimental  work  have  known  for  cen¬ 
turies  that  more  accurate  answers  are  obtained  when  measurements  or  ex¬ 
periments  are  repeated  a  number  of  times  and  one  averages  the  individual 
outcomes.^  For  example,  if  you  read  a  description  of  A. A.  Michelson’s  work 
done  in  1879  to  determine  the  speed  of  light,  you  would  find  that  for  each 
value  he  collected,  repeated  measurements  at  several  levels  were  performed. 
In  an  article  in  Statistical  Science  describing  his  work  ([18]),  R.J.  MacKay 
and  R.W.  Oldford  state:  “It  is  clear  that  Michelson  appreciated  the  power 
of  averaging  to  reduce  variability  in  measurement.”  We  shall  see  that  we  can 
understand  this  reduction  using  only  what  we  have  learned  so  far  about  prob¬ 
ability  in  combination  with  a  simple  inequality  called  Chebyshev’s  inequality. 
Throughout  this  chapter  we  consider  a  sequence  of  random  variables  Ai,  A2, 
A3,  ....  You  should  think  of  Xi  as  the  result  of  the  ith  repetition  of  a  partic¬ 
ular  measurement  or  experiment.  We  confine  ourselves  to  the  situation  where 

We  leave  the  problem  of  systematic  errors  aside  but  will  return  to  it  in  Chapter  19. 
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experimental  conditions  of  subsequent  experiments  are  identical,  and  the  out¬ 
come  of  any  one  experiment  does  not  influence  the  outcomes  of  others.  Under 
those  circumstances,  the  random  variables  of  the  sequence  are  independent, 
and  all  have  the  same  distribution,  and  we  therefore  call  Xi,  X2,  X3, ...  an 
independent  and  identieally  distributed  sequence.  We  shall  denote  the  distri¬ 
bution  function  of  each  random  variable  Xi  by  U,  its  expectation  by  /r,  and 
the  standard  deviation  by  a. 

The  average  of  the  first  n  random  variables  in  the  sequence  is 


^  _  Xi  +  X2  -i-  ■  •  •  -i- 

-^n  — 


and  using  linearity  of  expectations  we  find: 

E  [>^n]  =  ~E  [Xi  +  X2  -|-  ■  ■  ■  -|-  Xn]  =  -{fX+fi+'-'+fJ,)  =  /i. 

By  the  variance-of-the-sum  rule,  using  the  independence  of  Xi, . . .  ,Xn, 

X&rfXn)  =  — -Xw[(Xi  -\-  X2  -l-  •  •  •  -l-  Xn)  =  -I-  cr^  -|-  •  •  •  -I-  cr^)  =  — . 

^  ^  n‘‘  n 

This  establishes  the  following  rule. 


Expectation  and  variance  of  an  average.  If  is  the  average 
of  n  independent  random  variables  with  the  same  expectation  /i  and 
variance  cr^,  then 

2 

E  [X„]  =  pL  and  Var(X„)  =  — . 


The  expectation  of  Xn  is  again  /i,  and  its  standard  deviation  is  less  than  that 
of  a  single  Xi  by  a  factor  the  “typical  distance”  from  p  is  ^Jn  smaller. 
The  latter  property  is  what  Michelson  used  to  gain  accuracy.  To  illustrate 
this,  we  analyze  an  example. 

Suppose  the  random  variables  Xi,  X2,  ■  ■  ■  are  continuous  with  a  G'am(2, 1) 
distribution,  so  with  probability  density: 

f{x)  =  xe~^  for  a;  >  0. 

Recall  from  Section  11.2  that  this  means  that  each  Xi  is  distributed  as  the 
sum  of  two  independent  Exp{l)  random  variables.  Hence,  Sn  =  Xi  +  ■  ■  ■  +  Xn 
is  distributed  as  the  sum  of  2n  independent  Exp{l)  random  variables,  which 
has  a  Gam{2n,  1)  distribution,  with  probability  density 

^2n-l^-x 


fs„  (a:) 


(2n-  1)! 


for  a;  >  0. 
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Because  =  Sn/n,  we  find  by  applying  the  change-of-units  rule  (page  106): 


for  a;  >  0. 


This  is  the  probability  density  of  the  Gam(2n,n)  distribution. 

So  we  have  determined  the  distribution  of  Xn  explicitly  and  we  can  investigate 
what  happens  as  n  increases,  for  example,  by  plotting  probability  densities. 
In  the  left-hand  column  of  Figure  13.1  you  see  plots  of  fx^  for  n  =  1,  2,  4,  9, 
16,  and  400  (note  that  for  n  =  1  this  is  just  /  itself).  For  comparison,  we  take 
as  a  second  example  a  so-called  bimodal  density  function:  a  density  with  two 
bumps,  formally  called  modes.  For  the  same  values  of  n  we  determined  the 
probability  density  function  of  (unlike  the  previous  example,  we  are  not 
concerned  with  the  computations,  just  with  the  results).  The  graphs  of  these 
densities  are  given  side  by  side  with  the  gamma  densities  in  Figure  13.1. 

The  graphs  clearly  show  that,  as  n  increases,  there  is  “contraction”  of  the 
probability  mass  near  the  expected  value  /i  (for  the  gamma  densities  this  is  2, 
for  the  bimodal  densities  2.625). 

Quick  exercise  13.1  Compare  the  probabilities  that  Xn  is  within  0.5  of  its 
expected  value  for  n  =  1,  4,  16,  and  400.  Do  this  for  the  gamma  case  only 
by  estimating  the  probabilities  from  the  graphs  in  the  left-hand  column  of 
Figure  13.1. 

13.2  Chebyshev’s  inequality 

The  contraction  of  probability  mass  near  the  expectation  is  a  consequence  of 
the  fact  that,  for  any  probability  distribution,  most  probability  mass  is  within 
a  few  standard  deviations  from  the  expectation.  To  show  this  we  will  employ 
the  following  tool,  which  provides  a  bound  for  the  probability  that  the  random 
variable  Y  is  outside  the  interval  (E  [E]  —  a,  E  [E]  -|-  a). 

Chebyshev’s  inequality.  For  an  arbitrary  random  variable  Y 
and  any  a  >  0: 


P(|F  -  E  [E]  I  >  a)  <  ^Var(E) . 


We  shall  derive  this  inequality  for  continuous  E  (the  discrete  case  is  similar). 
Let  fy  be  the  probability  density  function  of  E.  Let  /i  denote  E[E].  Then: 


Var(E)  =  {y- ^^)^fY{y)dy 


>  [  a^fY{y)dy  =  a^Y{\Y-y.\>a). 
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Fig.  13.1.  Densities  of  averages.  Left  column:  from  a  gamma  density;  right  column; 
from  a  bimodal  density. 
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Dividing  both  sides  of  the  resulting  inequality  by  a?,  we  obtain  Chebyshev’s 
inequality. 

Denote  Var(y)  by  and  consider  the  probability  that  Y  is  within  a  few 
standard  deviations  from  its  expectation  /r: 

P(|F  —  1^1  <  ka)  =  1  —  P(|P  —  ^J■\>  ka) , 

where  fc  is  a  small  integer.  Setting  a  =  ka  in  Chebyshev’s  inequality,  we  find 

P(|r  -^,\<ka)>l-  (13.1) 

For  k  =  2,3,4  the  right-hand  side  is  3/4,  8/9,  and  15/16,  respectively.  This 
suggests  that  with  Chebyshev’s  inequality  we  can  make  very  strong  state¬ 
ments.  For  most  distributions,  however,  the  actual  value  of  P(|P  —  ^|  <  ka) 
is  even  higher  than  the  lower  bound  (13.1).  We  summarize  this  as  a  somewhat 
loose  rule. 


The  ±  A  FEW  cr”  RULE.  Most  of  the  probability  mass  of  a 
random  variable  is  within  a  few  standard  deviations  from  its  expec¬ 
tation. 

Quick  exercise  13.2  Calculate  P(|y  -  n\  <  ka)  exactly  for  k  =  1,2, 3, 4 
when  Y  has  an  Exp{l)  distribution  and  compare  this  with  the  bounds  from 
Chebyshev’s  inequality. 


13.3  The  law  of  large  numbers 

We  return  to  the  independent  and  identically  distributed  sequence  of  ran¬ 
dom  variables  Xi,lf2,...  with  expectation  /i  and  variance  a^.  We  apply 
Chebyshev’s  inequality  to  the  average  where  we  use  E  [Xn\  =  /r  and 
Var(X„)  =  a"^  jn,  and  where  e  >  0: 

P(|X„  -  >  e)  =  P(|X„  -  E  [Xn]  I  >  e)  <  ^Var(X„)  = 

The  right-hand  side  vanishes  as  n  goes  to  infinity,  no  matter  how  small  e  is. 
This  proves  the  following  law. 


The  law  of  large  numbers.  If  is  the  average  of  n  independent 
random  variables  with  expectation  ^  and  variance  cr^,  then  for  any 
e  >  0: 

lim  P(|X„  —  ^1  >  e)  =  0. 
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A  connection  with  experimental  work 

Let  us  try  to  interpret  the  law  of  large  numbers  from  an  experimenter’s  per¬ 
spective.  Imagine  you  conduct  a  series  of  experiments.  The  experimental  setup 
is  complicated  and  your  measurements  vary  quite  a  bit  around  the  “true”  value 
you  are  after.  Suppose  (unknown  to  you)  your  measurements  have  a  gamma 
distribution,  and  its  expectation  is  what  you  want  to  determine.  You  decide 
to  do  a  certain  number  of  measurements,  say  n,  and  to  use  their  average  as 
your  estimate  of  the  expectation. 

We  can  simulate  all  this,  and  Figure  13.2  shows  the  results  of  a  simulation, 
where  we  chose  the  same  Gam{2, 1)  distribution,  i.e.,  with  expectation  fx  =  2. 
We  anticipated  that  you  might  want  to  do  as  many  as  500  measurements,  so 
we  generated  realizations  for  Xi,  X2,  . . . ,  Y500.  For  each  n  we  computed  the 
average  of  the  first  n  values  and  plotted  these  averages  against  n  in  Figure  13.2. 


3  -  . 
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0  100  200  300  400  500 

Fig.  13.2.  Averages  of  realizations  of  a  sequence  of  gamma  distributed  random 
variables. 


If  your  decision  is  to  do  200  repetitions,  you  would  find  (in  this  simulation)  a 
value  of  about  2.09  (slightly  too  high,  but  you  wouldn’t  know!),  whereas  with 
n  =  400  you  would  be  almost  exactly  correct  with  1.99,  and  with  n  =  500 
again  a  little  farther  away  with  2.06.  For  another  sequence  of  realizations,  the 
details  in  the  pattern  that  you  see  in  Figure  13.2  would  be  different,  but  the 
general  dampening  of  the  oscillations  would  still  be  present.  This  follows  from 
what  we  saw  earlier,  that  as  n  is  larger,  the  probability  for  the  average  to  be 
within  a  certain  distance  of  the  expectation  increases,  in  the  limit  even  to  1. 
In  practice  it  may  happen  that  with  a  large  number  of  repetitions  your  average 
is  farther  from  the  “true”  value  than  with  a  smaller  number  of  repetitions — if 
it  is,  then  you  had  bad  luck,  because  the  odds  are  in  your  favor. 
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The  averages  may  fail  to  converge 

The  law  of  large  numbers  is  valid  if  the  expectation  of  the  distribution  F  is 
finite.  This  is  not  always  the  case.  For  example,  the  Cauchy  and  some  Pareto 
distributions  have  heavy  tails:  their  probability  densities  do  go  to  0  as  a: 
becomes  large,  but  (too)  slowly.^  On  the  left  in  Figure  13.3  you  see  the  result 
of  a  simulation  with  Cau{2, 1)  random  variables.  As  in  the  gamma  case,  the 
averages  tend  to  go  toward  2  (which  is  the  point  of  symmetry  of  the  Cau{2, 1) 
density),  but  once  in  a  while  a  very  large  (positive  or  negative)  realization  of 
an  Xi  throws  off  the  average. 


0  100  200  300  400  500  0  100  200  300  400  500 

Fig.  13.3.  Averages  of  realizations  of  a  sequence  of  Cauchy  (at  left)  and  Pareto  (at 
right)  distributed  random  variables. 


On  the  right  in  Figure  13.3  the  result  of  a  simulation  with  a  Par(0.99)  distri¬ 
bution  is  shown.  Its  expectation  is  infinite.  In  the  plot  we  see  segments  where 
the  average  “drifts  downward,”  separated  by  upward  jumps,  which  correspond 
to  Xi  with  extremely  large  values.  The  effect  of  the  jumps  dominates:  it  can 
be  shown  that  Xn  grows  beyond  any  level. 

You  might  think  that  these  patterns  are  phenomena  that  occur  because  of 
the  short  length  of  the  simulation  and  that  in  longer  simulations  they  would 
disappear  after  some  value  of  n.  However,  the  patterns  as  described  will  con¬ 
tinue  to  occur  and  the  results  of  a  longer  simulation,  say  to  n  =  5000,  would 
not  look  any  “better.” 

Remark  13.1  (There  is  a  stronger  law  of  large  numbers).  Even 
though  it  is  a  strong  statement,  the  law  of  large  numbers  in  this  paragraph 
is  more  accurately  known  as  the  weak  law  of  large  numbers.  A  stronger 
result  holds,  the  strong  law  of  large  numbers,  which  says  that: 

^  They  represent  two  separate  cases:  the  Cauchy  expectation  does  not  exist  (see 
Remark  7.1)  and  the  Par{a)’s  expectation  is  -too  if  a  <  1  (see  Section  7.2). 
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pT  lim  Xn  =  pi)  =  1. 

\n — »-oo  / 


This  is  also  expressed  as  “as  n  goes  to  infinity,  Xn  converges  to  pi  with 
probability  1.”  It  is  not  easy  to  see,  but  it  is  true  that  the  strong  law  is 
actually  stronger.  The  conditions  for  the  law  of  large  numbers,  as  stated 
in  this  section,  could  be  relaxed.  They  suffice  for  both  versions  of  the  law. 
The  conditions  can  be  weakened  to  a  point  where  the  weak  law  still  follows 
from  them,  but  the  strong  law  does  not  anymore;  the  strong  law  requires 
the  stronger  conditions. 


13.4  Consequences  of  the  law  of  large  numbers 

We  continue  with  the  sequence  Xi ,  X2 ,  . . .  of  independent  random  variables 
with  distribution  function  F.  In  the  previous  section  we  saw  how  we  could 
recover  the  (unknown)  expectation  pi  from  a  realization  of  the  sequence.  We 
shall  see  that  in  fact  we  can  recover  any  feature  of  the  probability  distribu¬ 
tion.  In  order  to  avoid  unnecessary  indices,  as  in  E[Xi]  and  P(Xi  €  C),  we 
introduce  an  additional  random  variable  X  that  also  has  F  as  its  distribution 
function. 

Recovering  the  probability  of  an  event 

Suppose  that,  rather  than  being  interested  in  /x  =  E  [Jf],  we  want  to  know  the 
probability  of  an  event,  for  example. 


p  =  P(X  e  C) ,  where  C  =  (a,  6]  for  some  a  <  b. 


If  you  do  not  know  this  probability  p,  you  would  probably  estimate  it  from 
how  often  the  event  {Xi  G  C}  occurs  in  the  sequence.  You  would  use  the 
relative  frequency  oi  Xi  £  C  among  Xi,  . . . ,  the  number  of  times  the 
set  C  was  hit  divided  by  n.  Define  for  each  i: 


1  ifXi€  C, 
0  \iXi(f,  C. 


The  random  variable  Yi  indicates  whether  the  corresponding  Xi  hits  the  set  C; 
it  is  called  an  indicator  random  variable.  In  general,  an  indicator  random 
variable  for  an  event  4  is  a  random  variable  that  is  1  when  A  occurs  and  0 
when  occurs.  Using  this  terminology,  Yi  is  the  indicator  random  variable 
of  the  event  Xi  G  C.  Its  expectation  is  given  by 

E  [Y,]  =  I  •  P{Xi  G  C)  -b  0  •  P(Y,  ^  C)  =  P{Xi  G  C)  =  P(X  GC)=p. 

Using  the  Yi,  the  relative  frequency  is  expressed  as  (Yl-I-Y2H - \-Yn)/n  =  Y„. 

Note  that  the  random  variables  Yi,  Y2,  ■  •  ■  are  independent;  the  Xi  form  an  in¬ 
dependent  sequence,  and  Yi  is  determined  from  Xi  only  (this  is  an  application 
of  the  rule  about  propagation  of  independence;  see  page  126). 
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The  law  of  large  numbers,  with  p  in  the  role  of  /i,  can  now  be  applied  to  Yn\ 
it  is  the  average  of  n  independent  random  variables  with  expectation  p  and 
variance  p{l  —  p),  so 

lim  P(|y„-p|  >e)  =0  (13.2) 

for  any  £  >  0.  By  reasoning  along  the  same  lines  as  in  the  previous  section,  we 
see  that  from  a  long  sequence  of  realizations  we  can  get  an  accurate  estimate 
of  the  probability  p. 

Recovering  the  probability  density  function 

Consider  the  continuous  case,  where  /  is  the  probability  density  function 
corresponding  with  F,  and  now  choose  C  =  {a  —  h,a  +  h],  for  some  (small) 
positive  h.  By  equation  (13.2),  for  large  n: 

nQ,-\-h 

Yr,^p  =  P{X  gC)=  /  f{x)dx^2hf{a).  (13.3) 

J  a—h 

This  relationship  suggests  to  estimate  the  probability  density  in  a  as  follows: 

the  number  of  times  Xi  £  C  for  i  <  n 
^  2h  n  •  the  length  of  C 


In  Figure  13.4  we  have  done  so  for  h  =  0.25  and  two  values  of  a:  2  and  4. 
Rather  than  plotting  the  estimate  in  just  one  point,  we  use  the  same  value 
for  the  whole  interval  {a  —  h,a  +  h\.  This  results  in  a  vertical  bar,  whose  area 
corresponds  to 


height  •  width 


—  ■2h  —  Y 
2h 


These  estimates  are  based  on  the  realizations  of  500  independent  Gam  {2, 1) 
distributed  random  variables.  In  order  to  be  able  to  see  how  well  things  came 


0  2  4  6  8  10 

Fig.  13.4.  Estimating  the  density  at  two  points. 
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out,  the  Gam(2, 1)  density  function  is  shown  as  well;  near  a  =  2  the  estimate 
is  very  accurate,  but  around  a  =  4  it  is  a  little  too  low. 

There  really  is  no  reason  to  derive  estimated  values  around  just  a  few  points, 
as  is  done  in  Figure  13.4.  We  might  as  well  cover  the  whole  x-axis  with  a  grid 
(with  grid  size  2h)  and  do  the  computation  for  each  point  in  the  grid,  thus 
covering  the  axis  with  a  series  of  bars.  The  resulting  bar  graph  is  called  a 
histogram.  Figure  13.5  shows  the  result  for  two  sets  of  realizations. 


Fig.  13.5.  Recovering  the  density  function  by  way  of  histograms. 


The  top  graph  is  constructed  from  the  same  realizations  as  Figure  13.4  and 
the  bottom  graph  is  constructed  from  a  new  set  of  realizations.  Both  graphs 
match  the  general  shape  of  the  density,  with  some  bumps  and  valleys  that  are 
particular  for  the  corresponding  set  of  realizations.  In  Chapters  15  and  17  we 
shall  return  to  histograms  and  treat  them  more  elaborately. 

Quick  exercise  13.3  The  height  of  the  bar  at  x  =  2  in  the  first  histogram 
is  0.26.  How  many  of  the  500  realizations  were  between  1.75  and  2.25? 
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13.5  Solutions  to  the  quick  exercises 

13.1  The  answers  you  have  found  should  be  in  the  neighborhood  of  the  fol¬ 
lowing  exact  values: 


n  1  4  16  400 

P(|X„-^|  <  0.5)  0.27  0.52  0.85  1.00 

13.2  Because  Y  has  an  Exp{l)  distribution  /r  =  1  and  Var(F)  =  cr^  =  1;  we 
find  for  fc  >  1: 

V{\Y-p\  <  fccr)  =P(|y-l|  <k) 

=  V{l-k  <Y  <k+l)  =  F{Y  <k+l)  =  l-  e-'^-k 

Using  this  formula  and  (13.1)  we  obtain  the  following  numbers: 


k  12  3  4 


Lower  bound  from  Chebyshev  0  0.750  0.889  0.938 

P(|y-1|<A:)  0.865  0.950  0.982  0.993 


13.3  The  value  of  for  this  bar  equals  its  area  0.26  •  0.5  =  0.13.  The  bar 
represents  13%  of  the  values,  or  0.13  •  500  =  65  realizations. 


13.6  Exercises 

13.1  Verify  the  “^±a  few  cr”  rule  as  you  did  in  Quick  exercise  13.2  for  the  fol¬ 
lowing  distributions:  U{—1, 1),  U{—a,  a),  N{0, 1),  iV(^,  cr^),  Par{3),  Geo{l/2). 
Construct  a  table  as  in  the  answer  to  the  quick  exercise  and  enter  a  line  for 
each  distribution. 

13.2  ffl  An  accountant  wants  to  simplify  his  bookkeeping  by  rounding  amounts 
to  the  nearest  integer,  for  example,  rounding  €99.53  and  €100.46  both  to 
€  100.  What  is  the  cumulative  effect  of  this  if  there  are,  say,  100  amounts?  To 
study  this  we  model  the  rounding  errors  by  100  independent  U (—0.5, 0.5)  ran¬ 
dom  variables  Xi,  X2,  . . . ,  Xiqq. 

a.  Compute  the  expectation  and  the  variance  of  the  Xi. 

b.  Use  Chebyshev’s  inequality  to  compute  an  upper  bound  for  the  probability 

PdXi  +  X2  + - h  ATiool  >  10)  that  the  cumulative  rounding  error  Xi  + 

Ai2  -I-  •  ■  ■  -I-  Vioo  exceeds  €  10. 
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13.3  Consider  the  situation  of  the  previous  exercise.  A  manager  wants  to 

know  what  happens  to  the  mean  absolute  error  ^  X]r=i  1^*1  becomes 

large.  What  can  you  say  about  this,  applying  the  law  of  large  numbers? 

13.4  ffl  Of  the  voters  in  Florida,  a  proportion  p  will  vote  for  candidate  G, 
and  a  proportion  1  —  p  will  vote  for  candidate  B .  In  an  election  poll  a  number 
of  voters  are  asked  for  whom  they  will  vote.  Let  Xi  be  the  indicator  random 
variable  for  the  event  “the  fth  person  interviewed  will  vote  for  G.”  A  model 
for  the  election  poll  is  that  the  people  to  be  interviewed  are  selected  in  such 
a  way  that  the  indicator  random  variables  Xi,  A2,. . .  are  independent  and 
have  a  Ber(p)  distribution. 

a.  Suppose  we  use  A„  to  predict  p.  According  to  Chebyshev’s  inequality,  how 
large  should  n  be  (how  many  people  should  be  interviewed)  such  that  the 
probability  that  Xn  is  within  0.2  of  the  “true”  p  is  at  least  0.9? 

Hint:  solve  this  first  for  p  =  1/2,  and  use  that  p(l  —  p)  <  1/4  for  all 
0  <p  <  1. 

b.  Answer  the  same  question,  but  now  A„  should  be  within  0.1  of  p. 

c.  Answer  the  question  from  part  a,  but  now  the  probability  should  be  at 
least  0.95. 

d.  If  p  >  1/2  candidate  G  wins;  if  Xn  >1/2  you  predict  that  G  will  win. 
Find  an  n  (as  small  as  you  can)  such  that  the  probability  that  you  predict 
correctly  is  at  least  0.9,  if  in  fact  p  =  0.6. 

13.5  You  are  trying  to  determine  the  melting  point  of  a  new  material,  of 
which  you  have  a  large  number  of  samples.  For  each  sample  that  you  measure 
you  find  a  value  close  to  the  actual  melting  point  c  but  corrupted  with  a 
measurement  error.  We  model  this  with  random  variables: 

Mi  =  c  Ui 

where  Mi  is  the  measured  value  in  degree  Kelvin,  and  Ui  is  the  occurring 
random  error.  It  is  known  that  E  \Ui]  =  0  and  Var(C/i)  =  3,  for  each  i,  and  that 
we  may  consider  the  random  variables  Mi,  M2,  . . .  independent.  According 
to  Ghebyshev’s  inequality,  how  many  samples  do  you  need  to  measure  to  be 
90%  sure  that  the  average  of  the  measurements  is  within  half  a  degree  of  c? 

13.6  □  The  casino  La  bella  Fortuna  is  for  sale  and  you  think  you  might  want 
to  buy  it,  but  you  want  to  know  how  much  money  you  are  going  to  make.  All 
the  present  owner  can  tell  you  is  that  the  roulette  game  Red  or  Black  is  played 
about  1000  times  a  night,  365  days  a  year.  Each  time  it  is  played  you  have 
probability  19/37  of  winning  the  player’s  bet  of  €  1  and  probability  18/37  of 
having  to  pay  the  player  €  1 . 

Explain  in  detail  why  the  law  of  large  numbers  can  be  used  to  determine  the 
income  of  the  casino,  and  determine  how  much  it  is. 
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13.7  Let  Xi,  X2,  ...  be  a  sequence  of  independent  and  identically  distributed 
random  variables  with  distributions  function  F.  Define  Fn  as  follows:  for  any  a 

number  of  Xi  in  (—00,  a] 


Consider  a  fixed  and  introduce  the  appropriate  indicator  random  variables  (as 
in  Section  13.4).  Compute  their  expectation  and  variance  and  show  that  the 
law  of  large  numbers  tells  us  that 

lim  P(|F„(a)  —  F(a)|  >  e)  =  0. 

n — ^oo 

13.8  □  In  Section  13.4  we  described  how  the  probability  density  function 
could  be  recovered  from  a  sequence  Xi,  X2,  X3,  ....  We  consider  the 
Gam{2, 1)  probability  density  discussed  in  the  main  text  and  a  histogram  bar 
around  the  point  a  =  2.  Then  /(a)  =  /(2)  =  2e“^  =  0.27  and  the  estimate 
for  /(2)  is  Ynf2h,  where  as  in  (13.3). 

a.  Express  the  standard  deviation  of  Yn/2h  in  terms  of  n  and  h. 

b.  Choose  h  =  0.25.  How  large  should  n  be  (according  to  Chebyshev’s  in¬ 
equality)  so  that  the  estimate  is  within  20%  of  the  “true  value”,  with 
probability  80%? 


13.9  ffl  Let  .^1,  .^2; 
variables  and  let  T„ 
e  >  0 


be  an  independent  sequence  of  f7(— 1,1)  random 
It  i®  claimed  that  for  some  a  and  any 

lim  P(|T„  —  a|  >  e)  =  0. 

n — >00 


a.  Explain  how  this  could  be  true. 

b.  Determine  a. 


13.10  □  Let  Mn  be  the  maximum  of  n  independent  U{0, 1)  random  variables. 

a.  Derive  the  exact  expression  for  P(|M„  —  1|  >  e). 

Hint:  see  Section  8.4. 

b.  Show  that  lim„^oo  Y{\Mn  —  1|  >  e)  =  0.  Can  this  be  derived  from  Cheby¬ 
shev’s  inequality  or  the  law  of  large  numbers? 

13.11  For  some  t  >  1,  let  X  be  a  random  variable  taking  the  values  0  and  t, 
with  probabilities 

P(X  =  0)  =  l-i  and  P(X  =  t)  =  i. 

Then  E[X]  =  1  and  Var(X)  =  t—l.  Consider  the  probability  P(|X  —  1|  >  a). 

a.  Verify  the  following:  if  t  =  10  and  a  =  8  then  P(|V  —  1|  >  a)  =  1/10  and 
Chebyshev’s  inequality  gives  an  upper  bound  for  this  probability  of  9/64. 
The  difference  is  9/64  —  1/10  «  0.04.  We  will  say  that  for  t  =  10  the 
Chebyshev  gap  for  V  at  a  =  8  is  0.04. 
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b.  Compute  the  Chebyshev  gap  for  t  =  10  at  a  =  5  and  at  a  =  10. 

c.  Can  you  find  a  gap  smaller  than  0.01,  smaller  than  0.001,  smaller  than 
0.0001? 

d.  Do  you  think  one  could  improve  Chebyshev’s  inequality,  i.e.,  find  an  upper 
bound  closer  to  the  true  probabilities? 

13.12  (A  more  general  law  of  large  numbers).  Let  Ai,A2,...  be  a 
sequence  of  independent  random  variables,  with  E[Ai]  =  and  Var(Ai)  = 

for  i  =  1,  2, _ Suppose  that  0  <  cr^  <  M,  for  all  i.  Let  a  be  an  arbitrary 

positive  number. 


a.  Apply  Chebyshev’s  inequality  to  show  that 


n  ^ 


>  a  < 


Var(Ai) 


Var(A„) 


b.  Conclude  from  a  that 


lim  P 

n — >-oo 


i=l 


=  0. 


Check  that  the  law  of  large  numbers  is  a  special  case  of  this  result. 
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The  central  limit  theorem 


The  central  limit  theorem  is  a  refinement  of  the  law  of  large  numbers. 
For  a  large  number  of  independent  identically  distributed  random  variables 
Xi, . . . ,  Xn,  with  finite  variance,  the  average  Xn  approximately  has  a  normal 
distribution,  no  matter  what  the  distribution  of  the  Xi  is.  In  the  first  section 
we  discuss  the  proper  normalization  of  Xn  to  obtain  a  normal  distribution 
in  the  limit.  In  the  second  section  we  will  use  the  central  limit  theorem  to 
approximate  probabilities  of  averages  and  sums  of  random  variables. 


14.1  Standardizing  averages 

In  the  previous  chapter  we  saw  that  the  law  of  large  numbers  guarantees 
the  convergence  to  /x  of  the  average  Xn  of  n  independent  random  variables 
Xi, . . . ,  Xn,  all  having  the  same  expectation  /x  and  variance  cr^.  This  conver¬ 
gence  was  illustrated  by  Figure  13.1.  Closer  examination  of  this  figure  suggests 
another  phenomenon:  for  the  two  distributions  considered  (i.e.,  the  Gam{2, 1) 
distribution  and  a  bimodal  distribution),  the  probability  density  function  of 
Xn  seems  to  become  symmetrical  and  bell  shaped  around  the  expected  value  /x 
as  n  becomes  larger  and  larger.  However,  the  bell  collapses  into  a  single  spike 
at  /X.  Nevertheless,  by  a  proper  normalization  it  is  possible  to  stabilize  the 
bell  shape,  as  we  will  see. 

In  order  to  let  the  distribution  of  Xn  settle  down  it  seems  to  be  a  good  idea 
to  stabilize  the  expectation  and  variance.  Since  E  [.^n]  =  M  for  n,  only  the 
variance  needs  some  special  attention.  In  Figure  14.1  we  depict  the  probability 
density  function  of  the  centered  average  /x  of  Gam(2, 1)  random  variables, 
multiplied  by  three  different  powers  of  n.  In  the  left  column  we  display  the 
density  of  nJ  (X„  —  ^),  in  the  middle  column  the  density  of  (X„  —  /x),  and 
in  the  right  column  the  density  of  n(X„  —  fi).  These  figures  suggest  that  yGi 
is  the  right  factor  to  stabilize  the  bell  shape. 
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Fig.  14.1.  Multiplying  the  difference  Xn  — /r  of  n  Gam{2, 1)  random  variables.  Left 
column:  n*  (X„  —  fj,);  middle  column:  yfn{XrL  —  n)\  right  column:  n{Xn  —  fj,). 
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Indeed,  according  to  the  rule  for  the  variance  of  an  average  (see  page  182), 
we  have  Var(X„)  =  cr^/n,  and  therefore  for  any  number  C: 

2 

Var(C(X„  -  n))  =  Var(C7X„)  =  C'2Var(X„)  = 

To  stabilize  the  variance  we  therefore  must  choose  C  =  y/n.  In  fact,  by  choos¬ 
ing  C  =  one  standardizes  the  averages,  i.e.,  the  resulting  random  vari¬ 

able  Zn,  defined  by 

Zn  =  y'n - ,  n  =  l,2, ..., 

a 

has  expected  value  0  and  variance  1.  What  more  can  we  say  about  the  distri¬ 
bution  of  the  random  variables  Znl 

In  case  Xi,X2,...  are  independent  distributed  random  variables, 

we  know  from  Section  11.2  and  the  rule  on  expectation  and  variance  under 
change  of  units  (see  page  98),  that  has  an  N{0, 1)  distribution  for  all  n.  For 
the  gamma  and  bimodal  random  variables  from  Section  13.1  we  depicted  the 
probability  density  function  of  Zn  in  Figure  14.2.  For  both  examples  we  see 
that  the  probability  density  functions  of  the  Zn  seem  to  converge  to  the  prob¬ 
ability  density  function  of  the  A^(0,1)  distribution,  indicated  by  the  dotted 
line.  The  following  amazing  result  states  that  this  behavior  generally  occurs 
no  matter  what  distribution  we  start  with. 


The  central  limit  theorem.  Let  Xi,X2,...  be  any  sequence 
of  independent  identically  distributed  random  variables  with  finite 
positive  variance.  Let  /i  be  the  expected  value  and  cr^  the  variance 
of  each  of  the  Xi.  For  n  >  1,  let  be  defined  by 

„  /—  Xn  —  A* 

a 

then  for  any  number  a 

lim  Fz^{a)  =  4)(a), 

n — >-oo 

where  4)  is  the  distribution  function  of  the  A^(0,1)  distribution.  In 
words:  the  distribution  function  of  Zn  converges  to  the  distribution 
function  <i>  of  the  standard  normal  distribution. 


Note  that 


Xn-E  [X„] 
Y^Var(X„)  ’ 


which  is  a  more  direct  way  to  see  that  Zn  is  the  average  Xn  standardized. 
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Fig.  14.2.  Densities  of  standardized  averages  Z^.  Left  column:  from  a  gamma  den¬ 
sity;  right  column:  from  a  bimodal  density.  Dotted  line:  Af(0, 1)  probability  density. 
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One  can  also  write  as  a  standardized  sum 


Zn  — 


+  •  •  •  +  -Zn  —  nfi 


awn 


(14.1) 


In  the  next  section  we  will  see  that  this  last  representation  of  is  very 
helpful  when  one  wants  to  approximate  probabilities  of  sums  of  independent 
identically  distributed  random  variables. 


Since 


= 


it  follows  that  Xn  approximately  has  an  N{ii,a‘^ /n)  distribution;  see  the 
change-of-units  rule  for  normal  random  variables  on  page  106.  This  explains 
the  symmetrical  bell  shape  of  the  probability  densities  in  Figure  13.1. 


Remark  14.1  (Some  history).  Originally,  the  central  limit  theorem  was 
proved  in  1733  by  De  Moivre  for  independent  Ber(^)  distributed  random 
variables.  Lagrange  extended  De  Moivre’s  result  to  Ber{p)  random  variables 
and  later  formulated  the  central  limit  theorem  as  stated  above.  Around 
1901  a  first  rigorous  proof  of  this  result  was  given  by  Lyapunov.  Several 
versions  of  the  central  limit  theorem  exist  with  weaker  conditions  than  those 
presented  here.  For  example,  for  applications  it  is  interesting  that  it  is  not 
necessary  that  all  Xi  have  the  same  distribution;  see  Ross  [26],  Section  8.3, 
or  Feller  [8],  Section  8.4,  and  Billingsley  [3],  Section  27. 


14.2  Applications  of  the  central  limit  theorem 

The  central  limit  theorem  provides  a  tool  to  approximate  the  probability 
distribution  of  the  average  or  the  sum  of  independent  identically  distributed 
random  variables.  This  plays  an  important  role  in  applications,  for  instance, 
see  Sections  23.4,  24.1,  26.2,  and  27.2.  Here  we  will  illustrate  the  use  of  the 
central  limit  theorem  to  approximate  probabilities  of  averages  and  sums  of 
random  variables  in  three  examples.  The  first  example  deals  with  an  average; 
the  other  two  concern  sums  of  random  variables. 

Did  we  have  bad  luck? 

In  the  example  in  Section  13.3  averages  of  independent  Gam(2, 1)  distributed 
random  variables  were  simulated  for  n  =  1, . . . ,  500.  In  Figure  13.2  the  realiza¬ 
tion  of  Xn  for  n  =  400  is  1.99,  which  is  almost  exactly  equal  to  the  expected 
value  2.  For  n  =  500  the  simulation  was  2.06,  a  little  bit  farther  away.  Did 
we  have  bad  luck,  or  is  a  value  2.06  or  higher  not  unusual?  To  answer  this 
question  we  want  to  compute  P(A„  >  2.06).  We  will  find  an  approximation 
of  this  probability  using  the  central  limit  theorem. 
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Note  that 


P(X„  >  2.06)  =V{Xn-  ^l>  2.06  -  ^i) 

u  ^2.06  —  u 

=  —  >  Vn - 


cr 


2.06 


(T 


Since  the  Xi  are  G'am(2,l)  random  variables,  fx  =  E[Xi]  =  2  and  cr^  = 
Var(Xi)  =  2.  We  find  for  n  =  500  that 


P(^500  >  0.95) 

1  —  P(Z5oo  <  0.95) . 


It  now  follows  from  the  central  limit  theorem  that 


P(^500  >  2.06)  «  1  -  $(0.95)  =  0.1711. 


This  is  close  to  the  exact  answer  0.1710881,  which  was  obtained  using  the 
probability  density  of  as  given  in  Section  13.1. 

Thus  we  see  that  there  is  about  a  17%  probability  that  the  average  X500  is  at 
least  0.06  above  2.  Since  17%  is  quite  large,  we  conclude  that  the  value  2.06 
is  not  unusual.  In  other  words,  we  did  not  have  bad  luck;  n  =  500  is  simply 
not  large  enough  to  be  that  close.  Would  2.06  be  unusual  if  n  =  5000? 

Quick  exercise  14.1  Show  that  P(X5ooo  >  2.06)  «  0.0013,  using  the  central 
limit  theorem. 

Rounding  amounts  to  the  nearest  integer 

In  Exercise  13.2  an  accountant  wanted  to  simplify  his  bookkeeping  by  round¬ 
ing  amounts  to  the  nearest  integer,  and  you  were  asked  to  use  Chebyshev’s 
inequality  to  compute  an  upper  bound  for  the  probability 


p  =  P(|Xl+X2  +  ---  +  Xioo|  >  10) 


that  the  cumulative  rounding  error  Xi  +  X2  +  ■  ■  ■  +  -^100  exceeds  €  10.  This 
upper  bound  equals  1/12.  In  order  to  know  the  exact  value  of  p  one  has  to 
determine  the  distribution  of  the  sum  Xi  +  -  ■  •  -l-Xioo.  This  is  difficult,  but  the 
central  limit  theorem  is  a  handy  tool  to  get  an  approximation  of  p.  Clearly, 


p  =  P{Xi  +  •  •  •  +  Xioo  <  -10)  +  P(Xi  +  •  •  •  +  Xioo  >  10) . 
Standardizing  as  in  (14.1),  for  the  second  probability  we  write,  with  n  =  100 
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P(A^l  +  •  •  •  +  Xn  ^  10)  —  H“  •  *  *  H“  ^10  —  Tlfl) 


=  P 


Xi  +  ■  ■  *  +  Xri  — 
cj^/n 


10  —  n/x\ 
a^/n  ) 


=  P 


> 


10  —  n^\ 
a-y/n  ) 


The  Xi  are  U{— 0.5, 0.5)  random  variables,  ^  =  E[Xi]  =  0,  and  = 
\ai{Xi)  =  1/12,  so  that 


P(Xi  +  ...  +  Xioo  >  10)  =  P 


10  -  100  ■  0  \ 
V^l/12v^  ) 


P(Zioo  >  3.46) . 


It  follows  from  the  central  limit  theorem  that 


P(Zioo  >  3.46)  «  1  -  4>(3.46)  =  0.0003. 


Similarly, 

P(Xi  +  •  •  •  +  Xioo  <  -10)  «  4>(-3.46)  =  0.0003. 

Thus  we  find  that  p  =  0.0006. 

Normal  approximation  of  the  binomial  distribution 

In  Section  4.3  we  considered  the  (fictitious)  situation  that  you  attend,  com¬ 
pletely  unprepared,  a  multiple-choice  exam  consisting  of  10  questions.  We  saw 
that  the  probability  you  will  pass  equals 

P(X  >  6)  =  0.0197, 

where  X — being  the  sum  of  10  independent  Ber{^)  random  variables — has 
a  BinilO,  j)  distribution.  As  we  saw  in  Chapter  4  it  is  rather  easy,  but  te¬ 
dious,  to  calculate  P(X  >  6).  Although  n  is  small,  we  investigate  what  the 
central  limit  theorem  will  yield  as  an  approximation  of  P(A  >  6).  Recall  that 
a  random  variable  with  a  Bin{n,p)  distribution  can  be  written  as  the  sum  of 
n  independent  Ber{p)  distributed  random  variables  i?i, . . . ,  Rn-  Substituting 
n  =  10,  p  =  p  =  1/4,  and  =  p{l  —  p)  =  3/16,  it  follows  from  the  central 
limit  theorem  that 

P(X  >  6)  =  P(Ri -f  •  •  • -f  >  6) 


«  1  -$(2.56)  =  0.0052. 
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The  number  0.0052  is  quite  a  poor  approximation  for  the  true  value  0.0197. 
Note  however,  that  we  could  also  argue  that 


P{X  >  6)  =  P{X  >  5) 


P(i?l  +  ■  •  ■  +  Rn  >  5) 


1  -  $(1.83)  =  0.0336, 


which  gives  an  approximation  that  is  too  large!  A  better  approach  lies  some¬ 
where  in  the  middle,  as  the  following  quick  exercise  illustrates. 

Quick  exercise  14.2  Apply  the  central  limit  theorem  to  find  0.0143  as  an  ap¬ 
proximation  to  P(A  >  5^).  Since  P(A  >  6)  =  P(A  >  5^),  this  also  provides 
an  approximation  of  P(A  >  6). 

How  large  should  n  be? 

In  view  of  the  previous  examples  one  might  raise  the  question  of  how  large  n 
should  be  to  have  a  good  approximation  when  using  the  central  limit  theorem. 
In  other  words,  how  fast  is  the  convergence  to  the  normal  distribution?  This 
is  a  difficult  question  to  answer  in  general.  For  instance,  in  the  third  example 
one  might  initially  be  tempted  to  think  that  the  approximation  was  quite 
poor,  but  after  taking  the  fact  into  account  that  we  approximate  a  discrete 
distribution  by  a  continuous  one  we  obtain  a  considerable  improvement  of  the 
approximation,  as  was  illustrated  in  Quick  exercise  14.2.  For  another  example, 
see  Figure  14.2.  Here  we  see  that  the  convergence  is  slightly  faster  for  the 
bimodal  distribution  than  for  the  Gam {2, 1)  distribution,  which  is  due  to  the 
fact  that  the  Gam(2, 1)  is  rather  asymmetric. 

In  general  the  approximation  might  be  poor  when  n  is  small,  when  the  dis¬ 
tribution  of  the  Xi  is  asymmetric,  bimodal,  or  discrete,  or  when  the  value  a 
in 


P(X„  >a) 


is  far  from  the  center  of  the  distribution  of  the  Xi 


14.3  Solutions  to  the  quick  exercises 


14.1  In  the  same  way  we  approximated  P(A„  >  2.06)  using  the  central  limit 
theorem,  we  have  that 


a 
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With  n  =  2  and  a  =  \/2^  we  find  for  n  =  5000  that 

P(-^5000  >  2.06)  =  P(^5ooo  >  3) , 

which  is  approximately  equal  to  1  —  d)(3)  =  0.0013,  thanks  to  the  central  limit 
theorem.  Because  we  think  that  0.13%  is  a  small  probability,  to  find  2.06  as 
a  value  for  X5000  would  mean  that  you  really  had  bad  luck! 


14.2  Similar  to  the  computation  P(J'f  >  6),  we  have 


PI  ^>5-1  =P(i?i  + 


4“  ^10  ^  5 


=  P 


«  1  -$(2.19)  =  0.0143. 


We  have  seen  that  using  the  central  limit  theorem  to  approximate  P(X  >6) 
gives  an  underestimate  of  this  probability,  while  using  the  central  limit  the¬ 
orem  to  P(X  >  5)  gives  an  overestimation.  Since  5^  is  “in  the  middle,”  the 
approximation  will  be  better. 


14.4  Exercises 

14.1  Let  Xi,  lf2, . . . ,  lfi44  be  independent  identically  distributed  random 

variables,  each  with  expected  value  /x  =  =  2,  and  variance  = 

Var(Xi)  =  4.  Approximate  P(Xi  -|-  AI2  -I-  •  •  ■  -I-  ^144  >  144),  using  the  central 
limit  theorem. 

14.2  □  Let  Ai,  A2, . . . ,  A625  be  independent  identically  distributed  random 
variables,  with  probability  density  function  /  given  by 

4-/  \  for0<x<l, 

f(x)  =  < 

(0  otherwise. 

Use  the  central  limit  theorem  to  approximate  P(Ai  -|-  A2  -I-  •  •  ■  -I-  ^525  <  170). 

14.3  ffl  In  Exercise  13.4  a  you  were  asked  to  use  Chebyshev’s  inequality  to 
determine  how  large  n  should  be  (how  many  people  should  be  interviewed)  so 
that  the  probability  that  A„  is  within  0.2  of  the  “true”  p  is  at  least  0.9.  Here 
p  is  the  proportion  of  the  voters  in  Florida  who  will  vote  for  G  (and  1  —  p  is 
the  proportion  of  the  voters  who  will  vote  for  B).  How  large  should  n  at  least 
be  according  to  the  central  limit  theorem? 
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14.4  □  In  the  single-server  queue  model  from  Section  6.4,  Ti  is  the  time 
between  the  arrival  of  the  (i  —  l)th  and  ith  customers.  Furthermore,  one 
of  the  model  assumptions  is  that  the  Ti  are  independent,  Exp  (0.5)  dis¬ 
tributed  random  variables.  In  Section  11.2  we  saw  that  the  probability 
P(ri  -I-  •  •  •  -I-  T30  <  60)  of  the  30th  customer  arriving  within  an  hour  at  the 
well  is  equal  to  0.542.  Find  the  normal  approximation  of  this  probability. 

14.5  ffl  Let  X  be  a  Bin(n,p)  distributed  random  variable.  Show  that  the 
random  variable 

X  —  np 

\/np(T^^ 

has  a  distribution  that  is  approximately  standard  normal. 

14.6  □  Again,  as  in  the  previous  exercise,  let  A  be  a  Bin(n,p)  distributed 
random  variable. 


a.  An  exact  computation  yields  that  P(A  <  25)  =  0.55347,  when  n  =  100 
and  p  =  1/4.  Use  the  central  limit  theorem  to  give  an  approximation  of 
P(A  <  25)  and  P(A  <  26). 

b.  When  n  =  100  andp  =  1/4,  then  P(A  <  2)  =  1.87  -10“^°.  Use  the  central 
limit  theorem  to  give  an  approximation  of  this  probability. 


14.7  Let  Ai,  A2, . . . ,  A„  be  n  independent  random  variables,  each  with  ex¬ 
pected  value  pL  and  finite  positive  variance  a^.  Use  Chebyshev’s  inequality  to 
show  that  for  any  a  >  0  one  has 


P 


i  A^n  M 


< 


1 


Use  this  fact  to  explain  the  occurrence  of  a  single  spike  in  the  left  column  of 
Figure  14.1. 

14.8  Let  Ai,  A2, . . .  be  a  sequence  of  independent  A(0, 1)  distributed  random 
variables.  For  n  =  1,  2, . . . ,  let  be  the  random  variable,  defined  by 

W  =  A?  +  . . .  +  Xl 


a.  Show  that  E  [A/]  =  1. 

b.  One  can  show — using  integration  by  parts — that  E  [A/]  =  3.  Deduce  from 
this  that  Var(A/)  =  2. 

c.  Use  the  central  limit  theorem  to  approximate  P(Yioo  >  110). 

14.9  ffl  A  factory  produces  links  for  heavy  metal  chains.  The  research  lab  of 
the  factory  models  the  length  (in  cm)  of  a  link  by  the  random  variable  A, 
with  expected  value  E[A]  =  5  and  variance  Var(A)  =  0.04.  The  length  of  a 
link  is  defined  in  such  a  way  that  the  length  of  a  chain  is  equal  to  the  sum  of 
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the  lengths  of  its  links.  The  factory  sells  chains  of  50  meters;  to  be  on  the  safe 
side  1002  links  are  used  for  such  chains.  The  factory  guarantees  that  the  chain 
is  not  shorter  than  50  meters.  If  by  chance  a  chain  is  too  short,  the  customer 
is  reimbursed,  and  a  new  chain  is  given  for  free. 

a.  Give  an  estimate  of  the  probability  that  for  a  chain  of  at  least  50  meters 
more  than  1002  links  are  needed.  For  what  percentage  of  the  chains  does 
the  factory  have  to  reimburse  clients  and  provide  free  chains? 

b.  The  sales  department  of  the  factory  notices  that  it  has  to  hand  out  a 
lot  of  free  chains  and  asks  the  research  lab  what  is  wrong.  After  further 
investigations  the  research  lab  reports  to  the  sales  department  that  the 
expectation  value  5  is  incorrect,  and  that  the  correct  value  is  4.99  (cm). 
Do  you  think  that  it  was  necessary  to  report  such  a  minor  change  of  this 
value? 

14.10  Chebyshev’s  inequality  was  used  in  Exercise  13.5  to  determine  how 
many  times  n  one  needs  to  measure  a  sample  to  be  90%  sure  that  the  average 
of  the  measurements  is  within  half  a  degree  of  the  actual  melting  point  c  of  a 
new  material. 

a.  Use  the  normal  approximation  to  find  a  less  conservative  value  for  n. 

b.  Only  in  case  the  random  errors  Ui  in  the  measurements  have  a  normal 
distribution  the  value  of  n  from  a  is  “exact,”  in  all  other  cases  an  approx¬ 
imation.  Explain  this. 


15 


Exploratory  data  analysis:  graphical  summaries 


In  the  previous  chapters  we  focused  on  probability  models  to  describe  random 
phenomena.  Confronted  with  a  new  phenomenon,  we  want  to  learn  about  the 
randomness  that  is  associated  with  it.  It  is  common  to  conduct  an  experiment 
for  this  purpose  and  record  observations  concerning  the  phenomenon.  The  set 
of  observations  is  called  a  dataset.  By  exploring  the  dataset  we  can  gain  insight 
into  what  probability  model  suits  the  phenomenon. 

Frequently  you  will  have  to  deal  with  a  dataset  that  contains  so  many  ele¬ 
ments  that  it  is  necessary  to  condense  the  data  for  easy  visual  comprehension 
of  general  characteristics.  In  this  chapter  we  present  several  graphical  methods 
to  do  so.  To  graphically  represent  univariate  datasets,  consisting  of  repeated 
measurements  of  one  particular  quantity,  we  discuss  the  classical  histogram, 
the  more  recently  introduced  kernel  density  estimates  and  the  empirical  dis¬ 
tribution  function.  To  represent  a  bivariate  dataset,  which  consists  of  repeated 
measurements  of  two  quantities,  we  use  the  scatterplot. 


15.1  Example:  the  Old  Faithful  data 

The  Old  Faithful  geyser  at  Yellowstone  National  Park,  Wyoming,  USA,  was 
observed  from  August  1st  to  August  15th,  1985.  During  that  time,  data  were 
collected  on  the  duration  of  eruptions.  There  were  272  eruptions  observed,  of 
which  the  recorded  durations  are  listed  in  Table  15.1.  The  data  are  given  in 
seconds. 

The  variety  in  the  lengths  of  the  eruptions  indicates  that  randomness  is  in¬ 
volved.  By  exploring  the  dataset  we  might  learn  about  this  randomness.  For 
instance:  we  like  to  know  which  durations  are  more  likely  to  occur  than  others; 
is  there  something  like  “the  typical  duration  of  an  eruption” ;  do  the  durations 
vary  symmetrically  around  the  center  of  the  dataset;  and  so  on.  In  order  to 
retrieve  this  type  of  information,  just  listing  the  observed  durations  does  not 
help  us  very  much.  Somehow  we  must  summarize  the  observed  data.  We  could 
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Table  15.1.  Duration  in  seconds  of  272  eruptions  of  the  Old  Faithful  geyser. 


216 

108 

200 

137 

272 

173 

282 

216 

117 

261 

110 

235 

252 

105 

282 

130 

105 

288 

96 

255 

108 

105 

207 

184 

272 

216 

118 

245 

231 

266 

258 

268 

202 

242 

230 

121 

112 

290 

110 

287 

261 

113 

274 

105 

272 

199 

230 

126 

278 

120 

288 

283 

110 

290 

104 

293 

223 

100 

274 

259 

134 

270 

105 

288 

109 

264 

250 

282 

124 

282 

242 

118 

270 

240 

119 

304 

121 

274 

233 

216 

248 

260 

246 

158 

244 

296 

237 

271 

130 

240 

132 

260 

112 

289 

110 

258 

280 

225 

112 

294 

149 

262 

126 

270 

243 

112 

282 

107 

291 

221 

284 

138 

294 

265 

102 

278 

139 

276 

109 

265 

157 

244 

255 

118 

276 

226 

115 

270 

136 

279 

112 

250 

168 

260 

110 

263 

113 

296 

122 

224 

254 

134 

272 

289 

260 

119 

278 

121 

306 

108 

302 

240 

144 

276 

214 

240 

270 

245 

108 

238 

132 

249 

120 

230 

210 

275 

142 

300 

116 

277 

115 

125 

275 

200 

250 

260 

270 

145 

240 

250 

113 

275 

255 

226 

122 

266 

245 

110 

265 

131 

288 

110 

288 

246 

238 

254 

210 

262 

135 

280 

126 

261 

248 

112 

276 

107 

262 

231 

116 

270 

143 

282 

112 

230 

205 

254 

144 

288 

120 

249 

112 

256 

105 

269 

240 

247 

245 

256 

235 

273 

245 

145 

251 

133 

267 

113 

111 

257 

237 

140 

249 

141 

296 

174 

275 

230 

125 

262 

128 

261 

132 

267 

214 

270 

249 

229 

235 

267 

120 

257 

286 

272 

111 

255 

119 

135 

285 

247 

129 

265 

109 

268 

Source:  W.  Hardle.  Smoothing  techniques  with  implementation  in  S.  1991; 
Table  3,  page  201.  (c)  Springer  New  York. 


start  by  computing  the  mean  of  the  data,  which  is  209.3  for  the  Old  Faithful 
data.  However,  this  is  a  poor  summary  of  the  dataset,  because  there  is  a  lot 
more  information  in  the  observed  durations.  How  do  we  get  hold  of  this? 
Just  staring  at  the  dataset  for  a  while  tells  us  very  little.  To  see  something, 
we  have  to  rearrange  the  data  somehow.  The  first  thing  we  could  do  is  order 
the  data.  The  result  is  shown  in  Table  15.2.  Putting  the  elements  in  order 
already  provides  more  information.  For  instance,  it  is  now  immediately  clear 
that  all  elements  lie  between  96  and  306. 

Quick  exercise  15.1  Which  two  elements  of  the  Old  Faithful  dataset  split 
the  dataset  in  three  groups  of  equal  size? 

A  closer  look  at  the  ordered  data  shows  that  the  two  middle  elements  (the 
136th  and  137th  elements  in  ascending  order)  are  equal  to  240,  which  is  much 
closer  to  the  maximum  value  306  than  to  the  minimum  value  96.  This  seems  to 
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Table  15.2.  Ordered  durations  of  eruptions  of  the  Old  Faithful  geyser. 


96 

100 

102 

104 

105 

105 

105 

105 

105 

105 

107 

107 

108 

108 

108 

108 

109 

109 

109 

110 

110 

110 

110 

110 

110 

110 
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112 

112 

112 

112 

112 

112 

112 

112 
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113 

113 

113 

115 

115 

116 

116 
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118 

118 

118 

119 

119 

119 

120 

120 

120 

120 

121 

121 

121 

122 

122 
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126 

126 

126 
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130 

130 
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132 
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135 

135 
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144 

144 

145 
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149 
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200 

200 
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205 
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210 
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214 
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230 

230 

230 

230 

230 

231 
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235 

235 

235 
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238 
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240 

240 

240 

240 

240 

240 
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242 
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245 

245 

245 

245 

245 

246 

246 
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248 

248 

249 

249 

249 

249 

250 

250 

250 

250 

251 

252 
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254 

254 
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255 
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256 

256 

257 

257 

258 

258 

259 

260 

260 

260 

260 

260 

261 

261 

261 

261 

262 

262 

262 

262 

263 
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265 

265 

265 

265 

266 

266 
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270 
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272 
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275 

275 

275 
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278 

278 
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282 

282 

282 

282 

282 

283 

284 

285 
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287 

288 

288 

288 

288 

288 

288 

289 

289 

290 

290 

291 

293 

294 

294 

296 

296 

296 

300 

302 

304 

306 

indicate  that  the  dataset  is  somewhat  asymmetric,  but  even  from  the  ordered 
dataset  we  cannot  get  a  clear  picture  of  this  asymmetry.  Also,  geologists  be¬ 
lieve  that  there  are  two  different  kinds  of  eruptions  that  play  a  role.  Hence  one 
would  expect  two  separate  values  around  which  the  elements  of  the  dataset 
would  accumulate,  corresponding  to  the  typical  durations  of  the  two  types  of 
eruptions.  Again  it  is  not  clear,  not  even  from  the  ordered  dataset,  what  these 
two  typical  values  are.  It  would  be  better  to  have  a  plot  of  the  dataset  that 
reflects  symmetry  or  asymmetry  of  the  data  and  from  which  we  can  easily  see 
where  the  elements  accumulate.  In  the  following  sections  we  will  discuss  two 
such  methods. 


15.2  Histograms 

The  classical  method  to  graphically  represent  data  is  the  histogram,  which 
probably  dates  from  the  mortality  studies  of  John  Graunt  in  1662  (see  West- 
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ergaard  [39],  p.22).  The  term  histogram  appears  to  have  been  used  first  by 
Karl  Pearson  ([22]).  Figure  15.1  displays  a  histogram  of  the  Old  Faithful  data. 
The  picture  immediately  reveals  the  asymmetry  of  the  dataset  and  the  fact 
that  the  elements  accumulate  somewhere  near  120  and  270,  which  was  not 
clear  from  Tables  15.1  and  15.2. 


60  120  180  240  300  360 

Fig.  15.1.  Histogram  of  the  Old  Faithful  data. 


The  construction  of  the  histogram  is  as  follows.  Let  us  denote  a  generic  (uni¬ 
variate)  dataset  of  size  n  by 


3^1 , 3^2 ,  ■  ■  ■  , 

and  suppose  we  want  to  construct  a  histogram.  We  use  the  version  of  the 
histogram  that  is  scaled  in  such  a  way  that  the  total  area  under  the  curve  is 
equal  to  one.^ 

First  we  divide  the  range  of  the  data  into  intervals.  These  intervals  are  called 
bins  and  are  denoted  by 

Bi,  B2,  ■  ■  ■  ,  Bm  ■ 

The  length  of  an  interval  Bi  is  denoted  by  \Bi\  and  is  called  the  bin  width. 
The  bins  do  not  necessarily  have  the  same  width.  In  Figure  15.1  we  have  eight 
bins  of  equal  bin  width.  We  want  the  area  under  the  histogram  on  each  bin 
Bi  to  reflect  the  number  of  elements  in  Bi.  Since  the  total  area  1  under  the 
histogram  then  corresponds  to  the  total  number  of  elements  n  in  the  dataset, 
the  area  under  the  histogram  on  a  bin  Bi  is  equal  to  the  proportion  of  elements 
in  B^: 

the  number  of  Xj  in  Bi 
n 

^  The  reason  to  scale  the  histogram  so  that  the  total  area  under  the  curve  is  equal  to 
one  is  that  if  we  view  the  data  as  being  generated  from  some  unknown  probability 
density  /  (see  Chapter  17),  such  a  histogram  can  be  used  as  a  crude  estimate  of  /. 
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The  height  of  the  histogram  on  bin  Bi  must  then  be  equal  to 

the  number  of  Xj  in  Bi 

n\B,\ 

Quick  exercise  15.2  Use  Table  15.2  to  count  how  many  elements  fall  into 
each  of  the  bins  (90,  120],  (120,  150],  . . . ,  (300,330]  in  Figure  15.1  and  com¬ 
pute  the  height  on  each  bin. 

Choice  of  the  bin  width 

Consider  a  histogram  with  bins  of  equal  width.  In  that  case  the  bins  are  of 
the  form 

Bi  =  {r  +  {i  —  1)6,  r  +  ih]  for  i=  1, 2, . . . ,  m, 

where  r  is  some  reference  point  smaller  than  the  minimum  of  the  dataset, 
and  6  denotes  the  bin  width.  In  Figure  15.2,  three  histograms  of  the  Old 
Faithful  data  of  Table  15.2  are  displayed  with  bin  widths  equal  to  2,  30,  and 
90,  respectively.  Clearly,  the  choice  of  the  bin  width  6,  or  the  corresponding 
choice  of  the  number  of  bins  m,  will  determine  what  the  resulting  histogram 
will  look  like.  Choosing  the  bin  width  too  small  will  result  in  a  chaotic  figure 
with  many  isolated  peaks.  Choosing  the  bin  width  too  large  will  result  in  a 
figure  without  much  detail,  at  the  risk  of  losing  information  about  general 
characteristics.  In  Figure  15.2,  bin  width  6  =  2  is  somewhat  too  small.  Bin 
width  6  =  90  is  clearly  too  large  and  produces  a  histogram  that  no  longer 
captures  the  fact  that  the  data  show  two  separate  modes  near  120  and  270. 

How  does  one  go  about  choosing  the  bin  width?  In  practice,  this  might  boil 
down  to  picking  the  bin  width  by  trial  and  error,  continuing  until  the  figure 
looks  reasonable.  Mathematical  research,  however,  has  provided  some  guide¬ 
lines  for  a  data-based  choice  for  6  or  m.  Formulas  that  may  effectively  be 
used  are  m  =  1  -I-  3.3  log2Q(n)  (see  [34])  or  6  =  3.49  (see  [29];  see  also 

Remark  15.1),  where  s  is  the  sample  standard  deviation  (see  Section  16.2  for 
the  definition  of  the  sample  standard  deviation) . 
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Fig.  15.2.  Histograms  of  the  Old  Faithful  data  with  different  bin  widths. 
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Remark  15.1  (Normal  reference  method  for  histograms).  Let 

Hn{x)  denote  the  height  of  the  histogram  at  x  and  suppose  that  we  view  our 
dataset  as  being  generated  from  a  probability  distribution  with  density  /. 
We  would  like  to  hnd  the  bin  width  that  minimizes  the  difference  between 
Hn  and  /,  measured  by  the  so-called  mean  integrated  squared  error  (MISE) 


E 


f{x)f  Ax 


Under  suitable  smoothness  conditions  on  /,  the  value  of  h  that  minimizes 
the  MISE  as  n  goes  to  inhnity  is  given  by 

/  coo  \  —1/3 

h  =  where  C(/)  =  6^^^  ij  f'{x)^  dxj 


(see  for  instance  [29]  or  [12]).  A  simple  data-based  choice  for  b  is  obtained  by 
estimating  the  constant  C{f).  The  normal  reference  method  takes  /  to  be 
the  density  of  an  distribution,  in  which  case  C(/)  =  (24y^)^^®(T. 

Estimating  a  by  the  sample  standard  deviation  s  (see  Chapter  16  for  a 
definition  of  s)  would  result  in  bin  width 

b  =  . 


For  the  Old  Faithful  data  this  would  give  b  =  36.89. 


Quick  exercise  15.3  If  we  construct  a  histogram  for  the  Old  Faithful  data 
with  equal  bin  width  b  =  3.49  how  may  bins  will  we  need  to  cover  the 

data  if  s  =  68.48? 


The  main  advantage  of  the  histogram  is  that  it  is  simple.  Its  disadvantage  is 
the  discrete  character  of  the  plot.  In  Figure  15. 1  it  is  still  somewhat  unclear 
which  two  values  correspond  to  the  typical  durations  of  the  two  types  of 
eruptions.  Another  well-known  artifact  is  that  changing  the  bin  width  slightly 
or  keeping  the  bin  width  fixed  and  shifting  the  bins  slightly  may  result  in  a 
figure  of  a  different  nature.  A  method  that  produces  a  smoother  figure  and  is 
less  sensitive  to  these  kinds  of  changes  will  be  discussed  in  the  next  section. 


15.3  Kernel  density  estimates 

We  can  graphically  represent  data  in  a  more  variegated  plot  by  a  so-called 
kernel  density  estimate.  The  basic  ideas  of  kernel  density  estimation  first  ap¬ 
peared  in  the  early  1950s.  Rosenblatt  [25]  and  Parzen  [21]  provided  the  stim¬ 
ulus  for  further  research  on  this  topic.  Although  the  method  was  introduced 
in  the  middle  of  the  last  century,  until  recently  it  remained  unpopular  as  a 
tool  for  practitioners  because  of  its  computationally  intensive  nature. 

Figure  15.3  displays  a  kernel  density  estimate  of  the  Old  Faithful  data.  Again 
the  picture  immediately  reveals  the  asymmetry  of  the  dataset,  but  it  is  much 
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Fig.  15.3.  Kernel  density  estimate  of  the  Old  Faithful  data. 


smoother  than  the  histogram  in  Figure  15.1.  Note  that  it  is  now  easier  to 
detect  the  two  typical  values  around  which  the  elements  accumulate. 

The  idea  behind  the  construction  of  the  plot  is  to  “put  a  pile  of  sand”  around 
each  element  of  the  dataset.  At  places  where  the  elements  accumulate,  the 
sand  will  pile  up.  The  actual  plot  is  constructed  by  choosing  a  kernel  K  and 
a  bandwidth  h.  The  kernel  K  reflects  the  shape  of  the  piles  of  sand,  whereas 
the  bandwidth  is  a  tuning  parameter  that  determines  how  wide  the  piles 
of  sand  will  be.  Formally,  a  kernel  AT  is  a  function  AT  :  M  — >  M.  Figure  15.4 
displays  several  well-known  kernels.  A  kernel  AT  typically  satisfies  the  following 
conditions: 

(Kl)  K  is  a  probability  density,  i.e.,  K{u)  >0  and  f^K(u)  du  =  1; 

(K2)  K  is  symmetric  around  zero,  i.e.,  K(u)  =  K(—u); 

(K3)  K(u)  =  0  for  |u|  >  1. 

Examples  are  the  Epanechnikov  kernel: 

K{u)  =  ^  (l  —  for  —  1  <  u  <  1 

and  K{u)  =  0  elsewhere,  and  the  triweight  kernel 

35 

K{u)  =  (l  —  for  —  1  <  m  <  1 

and  K{u)  =  0  elsewhere.  Sometimes  one  uses  kernels  that  do  not  satisfy 
condition  (K3),  for  example,  the  normal  kernel 

Kiu)  =  for  —  oo  <  u  <  oo. 

Let  us  denote  a  kernel  density  estimate  by  fn,h,  and  suppose  that  we  want  to 
construct  fn^h  for  a  dataset  xi,X2,  ■  ■  ■  ,Xn-  In  Figure  15.5  the  construction  is 
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Triangular  kernel 


Cosine  kernel 


Epanechnikov  kernel 


Biweight  kernel  Triweight  kernel  Normal  kernel 

Fig.  15.4.  Examples  of  well-known  kernels  K. 


illustrated  for  a  dataset  containing  five  elements,  where  we  use  the  Epanech¬ 
nikov  kernel  and  bandwidth  h  =  0.5.  First  we  scale  the  kernel  K  (solid  line) 
into  the  function 


-K 


The  scaled  kernel  (dotted  line)  is  of  the  same  type  as  the  original  kernel,  with 
area  1  under  the  curve  but  is  positive  on  the  interval  [— h,  K\  instead  of  [—1, 1] 
and  higher  (lower)  when  h  is  smaller  (larger)  than  1.  Next,  we  put  a  scaled 
kernel  around  each  element  Xi  in  the  dataset.  This  results  in  functions  of  the 
type 


These  shifted  kernels  (dotted  lines)  have  the  same  shape  as  the  transformed 
kernel,  all  with  area  1  under  the  curve,  but  they  are  now  symmetric  around 
Xi  and  positive  on  the  interval  [xi  —  h^Xi  +  h].  We  see  that  the  graphs  of  the 
shifted  kernels  will  overlap  whenever  Xi  and  Xj  are  close  to  each  other,  so 
that  things  will  pile  up  more  at  places  where  more  elements  accumulate.  The 
kernel  density  estimate  fn,h  is  constructed  by  summing  the  scaled  kernels  and 
dividing  them  by  n,  in  order  to  obtain  area  1  under  the  curve: 
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K 


-2-10  1  2 


-2-10  1  2 


-2-10  1  2 


Kernel  and  scaled  kernel 


Shifted  kernel 


Kernel  density  estimate 


Fig.  15.5.  Construction  of  a  kernel  density  estimate  fn,h- 


or  briefly, 


(15.1) 


When  computing  fn,h{t),  we  assign  higher  weights  to  observations  Xi  closer  to 
t,  in  contrast  to  the  histogram  where  we  simply  count  the  number  of  observa¬ 
tions  in  the  bin  that  contains  t.  Note  that  as  a  consequence  of  condition  (Kl), 
fn,h  itself  is  a  probability  density: 


Quick  exercise  15.4  Check  that  the  total  area  under  the  kernel  density 
estimate  is  equal  to  one,  i.e.,  show  that  fn,h{t)  dt  =  1. 

Note  that  computing  is  very  computationally  intensive.  Its  common  use 
nowadays  is  therefore  a  typical  product  of  the  recent  developments  in  com¬ 
puter  hardware,  despite  the  fact  that  the  method  was  introduced  much  earlier. 

Choice  of  the  bandwidth 

The  bandwidth  h  plays  the  same  role  for  kernel  density  estimates  as  the  bin 
width  b  does  for  histograms.  In  Figure  15.6  three  kernel  density  estimates  of 
the  Old  Faithful  data  are  plotted  with  the  triweight  kernel  and  bandwidths 
1.8,  18,  and  180.  It  is  clear  that  the  choice  of  the  bandwidth  h  determines 
largely  what  the  resulting  kernel  density  estimate  will  look  like.  Choosing  the 
bandwidth  too  small  will  produce  a  curve  with  many  isolated  peaks.  Choosing 
the  bandwidth  too  large  will  produce  a  very  smooth  curve,  at  the  risk  of 
smoothing  away  important  features  of  the  data.  In  Figure  15.6  bandwidth 
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h  =  1.8  is  somewhat  too  small.  Bandwidth  h  =  180  is  clearly  too  large  and 
produces  an  oversmoothed  kernel  density  estimate  that  no  longer  captures  the 
fact  that  the  data  show  two  separate  modes. 


Fig.  15.6.  Kernel  estimates  of  the  Old  Faithful  data. 


How  does  one  go  about  choosing  the  bandwidth?  Similar  to  histograms,  in 
practice  one  could  do  this  by  trial  and  error  and  continue  until  one  obtains 
a  reasonable  picture.  Recent  research,  however,  has  provided  some  guidelines 
for  a  data-based  choice  of  h.  A  formula  that  may  effectively  be  used  is  h  = 
1.06  where  s  denotes  the  sample  standard  deviation  (see,  for  instance, 

[31];  see  also  Remark  15.2). 


Remark  15.2  (Normal  reference  method  for  kernel  estimates). 

Suppose  we  view  our  dataset  as  being  generated  from  a  probability  dis¬ 
tribution  with  density  /.  Let  A  be  a  fixed  chosen  kernel  and  let  f„^h  be 
the  kernel  density  estimate.  We  would  like  to  take  the  bandwidth  that  min¬ 
imizes  the  difference  between  fn,h  and  /,  measured  by  the  so-called  mean 
integrated  squared  error  (MISE) 


E 


ifnAx)  -  /(*))^ 


Under  suitable  smoothness  conditions  on  /,  the  value  of  h  that  minimizes 
the  MISE,  as  n  goes  to  infinity,  is  given  by 


where  the  constants  C'i(/)  and  C2{K)  are  given  by 


Ciif) 


1/5 


and  C2{K) 


(jZKjufduy'" 

{lZ^u'^K{u)duj  ' 


After  choosing  the  kernel  K,  one  can  compute  the  constant  (72  (A)  to  obtain 
a  simple  data-based  choice  for  h  by  estimating  the  constant  (7i(/).  For 
instance,  for  the  normal  kernel  one  finds  (72(A)  =  (2y7r)“^/®.  As  with 
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histograms  (see  Remark  15.1),  the  normal  reference  method  takes  /  to  be 
the  density  of  an  N{ii,  a^)  distribution,  in  which  case  Ci(/)  = 

Estimating  cr  by  the  sample  standard  deviation  s  (see  Chapter  16  for  a 
definition  of  s)  would  result  in  bandwidth 

For  the  Old  Faithful  data,  this  would  give  h  =  23.64. 

Quick  exercise  15.5  If  we  construct  a  kernel  density  estimate  for  the  Old 
Faithful  data  with  bandwidth  h  =  then  on  what  interval  is  fn,h 

strictly  positive  if  s  =  68.48? 

Choice  of  the  kernel 

To  construct  a  kernel  density  estimate,  one  has  to  choose  a  kernel  K  and  a 
bandwidth  h.  The  choice  of  kernel  is  less  important.  In  Figure  15.7  we  have 
plotted  two  kernel  density  estimates  for  the  Old  Faithful  data  of  Table  15.1: 
one  is  constructed  with  the  triweight  kernel  (solid  line),  and  one  with  the 
Epanechnikov  kernel  (dotted  line),  both  with  the  same  bandwidth  h  =  24.  As 
one  can  see,  the  graphs  are  very  similar.  If  one  wants  to  compare  with  the 
normal  kernel,  one  should  set  the  bandwidth  of  the  normal  kernel  at  about 
/i/4.  This  has  to  do  with  the  fact  that  the  normal  kernel  is  much  more  spread 
out  than  the  two  kernels  mentioned  here,  which  are  zero  outside  [—1, 1]. 


Fig.  15.7.  Kernel  estimates  of  the  Old  Faithful  data  with  different  kernels:  triweight 
(solid  line)  and  Epanechnikov  kernel  (dotted),  both  with  bandwidth  h  =  24. 


Boundary  kernels 

In  order  to  estimate  the  parameters  of  a  software  reliability  model,  failure  data 
are  collected.  Usually  the  most  desirable  type  of  failure  data  results  when  the 
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Table  15.3.  Interfailure  times  between  successive  failures. 
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Source:  J.D.  Musa,  A.  lannino,  and  K.  Okumoto.  Software  reliability:  mea¬ 
surement,  prediction,  application.  McGraw-Hill,  New  York,  1987;  Table  on 
page  305. 


failure  times  are  recorded,  or  equivalently,  the  length  of  an  interval  between 
successive  failures.  The  data  in  Table  15.3  are  observed  interfailure  times  in 
CPU  seconds  for  a  certain  control  software  system.  On  the  left  in  Figure  15.8 
a  kernel  density  estimate  of  the  observed  interfailure  times  is  plotted.  Note 
that  to  the  left  of  the  origin,  fn^h  is  positive.  This  is  absurd,  since  it  suggests 
that  there  are  negative  interfailure  times. 

This  phenomenon  is  a  consequence  of  the  fact  that  one  uses  a  symmetric  ker¬ 
nel.  In  that  case,  the  resulting  kernel  density  estimate  will  always  be  positive 
on  the  interval  [xi  —  h,Xi  +  h]  for  every  element  Xi  in  the  dataset.  Hence,  obser- 
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Fig.  15.8.  Kernel  density  estimate  of  the  software  reliability  data  with  symmetric 
and  boundary  kernel. 
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vations  close  to  zero  will  cause  the  kernel  density  estimate  fn,h  to  be  positive 
to  the  left  of  zero.  It  is  possible  to  improve  the  kernel  density  estimate  in  a 
neighborhood  of  zero  by  means  of  a  so-called  boundary  kernel.  Without  going 
into  detail  about  the  construction  of  such  an  improvement,  we  will  only  show 
the  result  of  this.  On  the  right  in  Figure  15.8  the  histogram  of  the  interfailure 
times  is  plotted  together  with  the  kernel  density  estimate  constructed  with  a 
symmetric  kernel  (dotted  line)  and  with  the  boundary  kernel  density  estimate 
(solid  line).  The  boundary  kernel  density  estimate  is  0  to  the  left  of  the  ori¬ 
gin  and  is  adjusted  on  the  interval  [0,h).  On  the  interval  [ft.,  oo)  both  kernel 
density  estimates  are  the  same. 


15.4  The  empirical  distribution  function 

Another  way  to  graphically  represent  a  dataset  is  to  plot  the  data  in  a  cumu¬ 
lative  manner.  This  can  be  done  using  the  empirical  cumulative  distribution 
function  of  the  data.  It  is  denoted  by  Fn  and  is  defined  at  a  point  x  as  the 
proportion  of  elements  in  the  dataset  that  are  less  than  or  equal  to  x: 

number  of  elements  in  the  dataset  <  x 

Fn[x)  =  - . 

n 

To  illustrate  the  construction  of  Fn,  consider  the  dataset  consisting  of  the 
elements 
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The  corresponding  empirical  distribution  function  is  displayed  in  Figure  15.9. 
For  a;  <  1,  there  are  no  elements  less  than  or  equal  to  x,  so  that  Fn{x)  =  0.  For 
1  <  x  <  3,  only  the  element  1  is  less  than  or  equal  to  x,  so  that  Fn{x)  =  1/5. 
For  3  <  X  <  4,  the  elements  1  and  3  are  less  than  or  equal  to  x,  so  that 
Fn{x)  =  2/5,  and  so  on. 

In  general,  the  graph  of  Fn  has  the  form  of  a  staircase,  with  Fn{x)  =  0  for  all 
X  smaller  than  the  minimum  of  the  dataset  and  Fn{x)  =  1  for  all  x  greater 
than  the  maximum  of  the  dataset.  Between  the  minimum  and  maximum,  Fn 
has  a  jump  of  size  1/n  at  each  element  of  the  dataset  and  is  constant  between 
successive  elements.  In  Figure  15.9,  the  marks  •  and  o  are  added  to  the  graph 
to  emphasize  the  fact  that,  for  instance,  the  value  of  F’„(x)  at  x  =  3  is  0.4,  not 
0.2.  Usually,  we  leave  these  out,  and  one  might  also  connect  the  horizontal 
segments  by  vertical  lines. 

In  Figure  15.10  the  empirical  distribution  functions  are  plotted  for  the  Old 
Faithful  data  and  the  software  reliability  data.  The  fact  that  the  Old  Faithful 
data  accumulate  in  the  neighborhood  of  120  and  270  is  reflected  in  the  graph 
of  Fn  by  the  fact  that  it  is  steeper  at  these  places:  the  jumps  of  Fn  succeed  each 
other  faster.  In  regions  where  the  elements  of  the  dataset  are  more  stretched 
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Fig.  15.9.  Empirical  distribution  function. 


out,  the  graph  of  Fn  is  flatter.  Similar  behavior  can  be  seen  for  the  software 
reliability  data  in  the  neighborhood  of  zero.  The  elements  accumulate  more 
close  to  zero,  less  as  we  move  to  the  right.  This  is  reflected  by  the  empirical 
distribution  function,  which  is  very  steep  near  zero  and  flattens  out  if  we  move 
to  the  right. 

The  graph  of  the  empirical  distribution  function  for  the  Old  Faithful  data 
agrees  with  the  histogram  in  Figure  15.1  whose  height  is  the  largest  on  the 
bins  (90,  120]  and  (240,  270].  In  fact,  there  is  a  one-to-one  relation  between  the 
two  graphical  summaries  of  the  data:  the  area  under  the  histogram  on  a  single 
bin  is  equal  to  the  relative  frequency  of  elements  that  lie  in  that  bin,  which  is 
also  equal  to  the  increase  of  Fn  on  that  bin.  For  instance,  the  area  under  the 
histogram  on  bin  (240,  270]  for  the  Old  Faithful  data  is  equal  to  30  •  0.0092  = 
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Fig.  15.10.  Empirical  distribution  function  of  the  Old  Faithful  data  and  the  soft¬ 
ware  reliability  data. 
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0.276  (see  Quick  exercise  15.2).  On  the  other  hand,  F„(270)  =  215/272  = 
0.7904  and  F„(240)  =  140/272  =  0.5147,  whose  difference  F„(270)  -  F„(240) 
is  also  equal  to  0.276. 

Quick  exercise  15.6  Suppose  that  for  a  dataset  consisting  of  300  elements, 
the  value  of  the  empirical  distribution  function  in  the  point  1.5  is  equal  to 
0.7.  How  many  elements  in  the  dataset  are  strictly  greater  than  1.5? 

Remark  15.3  as  a  discrete  distribution  function).  Note  that 
Fn  satisfies  the  four  properties  of  a  distribution  function:  it  is  continuous 
from  the  right,  Fn{x)  — >  0  as  a:  — >  — oo,  Fn{x)  —>  1  as  x  —>  oo  and  Fn  is 
nondecreasing.  This  means  that  Fn  itself  is  a  distribution  function  of  some 
random  variable.  Indeed,  Fn  is  the  distribution  function  of  the  discrete  ran¬ 
dom  variable  that  attains  values  xi,X2,  ■  ■  ■  ,Xn  with  equal  probability  1/n. 


15.5  Scatterplot 

In  some  situations  one  wants  to  investigate  the  relationship  between  two  or 
more  variables.  In  the  case  of  two  variables  x  and  y,  the  dataset  consists  of 
pairs  of  observations: 


{xi,yi),  {X2,y2),  ixn,yn)- 

We  call  such  a  dataset  a  bivariate  dataset  in  contrast  to  the  univariate  dataset, 
which  consists  of  observations  of  one  particular  quantity.  We  often  like  to  in¬ 
vestigate  whether  the  value  of  variable  y  depends  on  the  value  of  the  variable  x, 
and  if  so,  whether  we  can  describe  the  relation  between  the  two  variables.  A 
first  step  is  to  take  a  look  at  the  data,  i.e.,  to  plot  the  points  {xi,yi)  for 
i  =  1,2 ...  ,n.  Such  a  plot  is  called  a  scatterplot. 

Drilling  in  rock 

During  a  study  about  “dry”  and  “wet”  drilling  in  rock,  six  holes  were  drilled, 
three  corresponding  to  each  process.  In  a  dry  hole  one  forces  compressed  air 
down  the  drill  rods  to  flush  the  cutting  and  the  drive  hammer,  whereas  in  a 
wet  hole  one  forces  water.  As  the  hole  gets  deeper,  one  has  to  add  a  rod  of 
5  feet  length  to  the  drill.  In  each  hole  the  time  was  recorded  to  advance  5 
feet  to  a  total  depth  of  400  feet.  The  data  in  Table  15.4  are  in  1/100  minute 
and  are  derived  from  the  original  data  in  [23].  The  original  data  consisted  of 
drill  times  for  each  of  the  six  holes  and  contained  missing  observations  and 
observations  that  were  known  to  be  too  large.  The  data  in  Table  15.4  are  the 
mean  drill  times  of  the  bona  fide  observations  at  each  depth  for  dry  and  wet 
drilling. 

One  of  the  questions  of  interest  is  whether  drill  time  depends  on  depth.  To  in¬ 
vestigate  this,  we  plot  the  mean  drill  time  against  depth.  Figure  15.11  displays 
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Table  15.4.  Mean  drill  time. 


Depth 

Dry 

Wet 

Depth 

Dry 

Wet 

5 

640.67 

830.00 

205 

803.33 

962.33 

10 

674.67 

800.00 

210 

794.33 

864.67 

15 

708.00 

711.33 

215 

760.67 

805.67 

20 

735.67 

867.67 

220 

789.50 

966.00 

25 

754.33 

940.67 

225 

904.50 

1010.33 

30 

723.33 

941.33 

230 

940.50 

936.33 

35 

664.33 

924.33 

235 

882.00 

915.67 

40 

727.67 

873.00 

240 

783.50 

956.33 

45 

658.67 

874.67 

245 

843.50 

936.00 

50 

658.00 

843.33 

250 

813.50 

803.67 

55 

705.67 

885.67 

255 

658.00 

697.33 

60 

700.00 

881.67 

260 

702.50 

795.67 

65 

720.67 

822.00 

265 

623.50 

1045.33 

70 

701.33 

886.33 

270 

739.00 

1029.67 

75 

716.67 

842.50 

275 

907.50 

977.00 

80 

649.67 

874.67 

280 

846.00 

1054.33 

85 

667.33 

889.33 

285 

829.00 

1001.33 

90 

612.67 

870.67 

290 

975.50 

1042.00 

95 

656.67 

916.00 

295 

998.00 

1200.67 

100 

614.00 

888.33 

300 

1037.50 

1172.67 

105 

584.00 

835.33 

305 

984.00 

1019.67 

no 

619.67 

776.33 

310 

972.50 

990.33 

115 

666.00 

811.67 

315 

834.00 

1173.33 

120 

695.00 

874.67 

320 

675.00 

1165.67 

125 

702.00 

846.00 

325 

686.00 

1142.00 

130 

739.67 

920.67 

330 

963.00 

1030.67 

135 

790.67 

896.33 

335 

961.50 

1089.67 

140 

730.33 

810.33 

340 

932.00 

1154.33 

145 

674.00 

912.33 

345 

1054.00 

1238.50 

150 

749.00 

862.33 

350 

1038.00 

1208.67 

155 

709.67 

828.00 

355 

1238.00 

1134.67 

160 

769.00 

812.67 

360 

927.00 

1088.00 

165 

663.00 

795.67 

365 

850.00 

1004.00 

170 

679.33 

897.67 

370 

1066.00 

1104.00 

175 

740.67 

881.00 

375 

962.50 

970.33 

180 

776.50 

819.67 

380 

1025.50 

1054.50 

185 

688.00 

853.33 

385 

1205.50 

1143.50 

190 

761.67 

844.33 

390 

1168.00 

1044.00 

195 

800.00 

919.00 

395 

1032.50 

978.33 

200 

845.50 

933.33 

400 

1162.00 

1104.00 

Source:  R.  Penner  and  D.G.  Watts.  Mining  information.  The  American 
Statistician,  45:4—9,  1991;  Table  1  on  page  6. 
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Fig.  15.11.  Scatterplots  of  mean  drill  time  versus  depth. 


the  resulting  scatterplots  for  the  dry  and  wet  holes.  The  scatterplots  seem  to 
indicate  that  in  the  beginning  the  drill  time  hardly  depends  on  depth,  at  least 
up  to,  let’s  say,  250  feet.  At  greater  depth,  the  drill  time  seems  to  vary  over  a 
larger  range  and  increases  somewhat  with  depth.  A  possible  explanation  for 
this  is  that  the  drill  moved  from  softer  to  harder  material.  This  was  suggested 
by  the  fact  that  the  drill  hit  an  ore  lens  at  about  250  feet  and  that  the  natural 
place  such  ore  lenses  occur  is  between  two  different  materials  (see  [23]  for 
details) . 

A  more  important  question  is  whether  one  can  drill  holes  faster  using  dry 
drilling  or  wet  drilling.  The  scatterplots  seem  to  suggest  that  dry  drilling 
might  be  faster.  We  will  come  back  to  this  later. 

Predicting  Janka  hardness  of  Australian  timber 

The  Janka  hardness  test  is  a  standard  test  to  measure  the  hardness  of  wood. 
It  measures  the  force  required  to  push  a  steel  ball  with  a  diameter  of  11.28 
millimeters  (0.444  inch)  into  the  wood  to  a  depth  of  half  the  ball’s  diameter. 
To  measure  Janka  hardness  directly  is  difficult.  However,  it  is  related  to  the 
density  of  the  wood,  which  is  comparatively  easy  to  measure.  In  Table  15.5 
a  bivariate  dataset  is  given  of  density  (a:)  and  Janka  hardness  (y)  of  36  Aus¬ 
tralian  eucalypt  hardwoods. 

In  order  to  get  an  impression  of  the  relationship  between  hardness  and  den¬ 
sity,  we  made  a  scatterplot  of  the  bivariate  dataset,  which  is  displayed  in 
Figure  15.12.  It  consists  of  all  points  {xi,yi)  for  i  =  1,2, .. .  ,36.  The  scatter¬ 
plot  might  provide  suggestions  for  the  formula  that  describes  the  relationship 
between  the  variables  x  and  y.  In  this  case,  a  linear  relationship  between  the 
two  variables  does  not  seem  unreasonable.  Later  (Chapter  22)  we  will  discuss 
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Table  15.5.  Density  and  hardness  of  Australian  timber. 


Density 

Hardness 

Density 

Hardness 

Density 

Hardness 

24.7 

484 

39.4 

1210 

53.4 

1880 

24.8 

427 

39.9 

989 

56.0 

1980 

27.3 

413 

40.3 

1160 

56.5 

1820 

28.4 

517 

40.6 

1010 

57.3 

2020 

28.4 

549 

40.7 

1100 

57.6 

1980 

29.0 

648 

40.7 

1130 

59.2 

2310 

30.3 

587 

42.9 

1270 

59.8 

1940 

32.7 

704 

45.8 

1180 

66.0 

3260 

35.6 

979 

46.9 

1400 

67.4 

2700 

38.5 

914 

48.2 

1760 

68.8 

2890 

38.8 

1070 

51.5 

1710 

69.1 

2740 

39.3 

1020 

51.5 

2010 

69.1 

3140 

Source:  E.J.  Williams.  Regression  analysis.  John  Wiley  &C  Sons  Inc.,  New 
York,  1959;  Table  3.1  on  page  43. 


how  one  can  establish  such  a  linear  relationship  by  means  of  the  observed 
pairs. 

Quick  exercise  15.7  Suppose  we  have  a  eucalypt  hardwood  tree  with  den¬ 
sity  65.  What  would  your  prediction  be  for  the  corresponding  Janka  hardness? 
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Fig.  15.12.  Scatterplot  of  Janka  hardness  versus  density  of  wood. 
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15.6  Solutions  to  the  quick  exercises 


15.1  There  are  272  elements  in  the  dataset.  The  91st  and  182nd  elements 
of  the  ordered  data  divide  the  dataset  in  three  groups,  each  consisting  of  90 
elements.  From  a  closer  look  at  Table  15.2  we  find  that  these  two  elements 
are  145  and  260. 


15.2  In  Table  15.2  one  can  easily  count  the  number  of  observations  in  each 
of  the  bins  (90,  120], . . . ,  (300,  330].  The  heights  on  each  bin  can  be  computed 
by  dividing  the  number  of  observations  in  each  bin  by  272  •  30  =  8160.  We  get 
the  following: 


Bin 

Count 

Height 

Bin 

Count 

Height 

(90,120] 

55 

0.0067 

(210,240] 

34 

0.0042 

(120,150] 

37 

0.0045 

(240,270] 

75 

0.0092 

(150,180] 

5 

0.0006 

(270,300] 

54 

0.0066 

(180,210] 

9 

0.0011 

(300,330] 

3 

0.0004 

15.3  From  Table  15.2  we  see  that  we  must  cover  an  interval  of  length  of  at 
least  306  —  96  =  210  with  bins  of  width  b  =  3.49  •  68.48  •  272“^/^  =  36.89. 
Since  210/36.89  =  5.69,  we  need  at  least  six  bins  to  cover  the  whole  dataset. 

15.4  By  means  of  formula  (15.1),  we  can  write 


dt  =  4E 


K 


t  -  X, 


dt. 


For  any  i  =  1, . . . ,  n,  we  find  by  change  of  integration  variables  t  =  hu  +  Xi 
that 

J  K  dt  =  h  J  K{u)du  =  h, 

where  we  also  use  condition  (Kl).  This  directly  yields 

=  —  ■  n  •  h  =  1. 
nh 


15.5  The  kernel  density  estimate  will  be  strictly  positive  between  the  min¬ 
imum  minus  h  and  the  maximum  plus  h.  The  bandwidth  equals  h  =  1.06  • 
68.48  •  272“^/^  =  23.66.  From  Table  15.2,  we  see  that  this  will  be  between 
96  -  23.66  =  72.34  and  306  -k  23.66  =  329.66. 


15.6  By  definition  the  number  of  elements  less  than  or  equal  to  1.5  is 
F3oo(1.5)  •  300  =  210.  Hence  90  elements  are  strictly  greater  than  1.5. 

15.7  Just  by  drawing  a  straight  line  that  seems  to  fit  the  datapoints  well,  the 
authors  predicted  a  Janka  hardness  of  about  2700. 
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15.7  Exercises 


15.1  In  [33]  Stephen  Stigler  discusses  data  from  the  Edinburgh  Medical  and 
Surgical  Journal  (1817).  These  concern  the  chest  circumference  of  5732  Scot¬ 
tish  soldiers,  measured  in  inches.  The  following  information  is  given  about  the 
histogram  with  bin  width  1,  the  first  bin  starting  at  32.5. 


Bin 

Count 

Bin 

Count 

(32.5,  33.5] 

3 

(40.5,  41.5] 

935 

(33.5,  34.5] 

19 

(41.5,  42.5] 

646 

(34.5,  35.5] 

81 

(42.5,  43.5] 

313 

(35.5,  36.5] 

189 

(43.5,  44.5] 

168 

(36.5,  37.5] 

409 

(44.5,  45.5] 

50 

(37.5,  38.5] 

753 

(45.5,  46.5] 

18 

(38.5,  39.5] 

1062 

(46.5,  47.5] 

3 

(39.5,  40.5] 

1082 

(47.5,  48.5] 

1 

Source:  S.M.  Stigler.  The  history  of  statistics  -  The  measurement  of  uncer¬ 
tainty  before  1900.  Cambridge,  Massachusetts,  1986. 


a.  Compute  the  height  of  the  histogram  on  each  bin. 

b.  Make  a  sketch  of  the  histogram.  Would  you  view  the  dataset  as  being 
symmetric  or  skewed? 

15.2  Recall  the  example  of  the  space  shuttle  Challenger  in  Section  1.4.  The 
following  list  contains  the  launch  temperatures  in  degrees  Fahrenheit  during 
previous  takeoffs. 

66  70  69  68  67  72  73  70  57  63  70  78 

67  53  67  75  70  81  76  79  75  76  58 

Source:  Presidential  commission  on  the  space  shuttle  Challenger  accident. 

Report  on  the  space  shuttle  Challenger  accident.  Washington,  DC,  1986;  table 
on  pages  129—131. 


a.  Compute  the  heights  of  a  histogram  with  bin  width  5,  the  first  bin  starting 
at  50. 

b.  On  January  28,  1986,  during  the  launch  of  the  space  shuttle  Challenger, 
the  temperature  was  31  degrees  Fahrenheit.  Given  the  dataset  of  launch 
temperatures  of  previous  takeoffs,  would  you  consider  31  as  a  representa¬ 
tive  launch  temperature? 

15.3  □  In  an  article  in  Biometrika,  an  example  is  discussed  about  mine  dis¬ 
asters  during  the  period  from  March  15,  1851,  to  March,  22,  1962.  A  dataset 
has  been  obtained  of  190  recorded  time  intervals  (in  days)  between  successive 
coal  mine  disasters  involving  ten  or  more  men  killed.  The  ordered  data  are 
listed  in  Table  15.6. 
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Table  15.6.  Number  of  days  between  successive  coal  mine  disasters. 


0 

1 

1 

2 

2 

3 

4 

4 

4 

6 

7 

10 

11 

12 

12 

12 

13 

15 

15 

16 

16 

16 

17 

17 

18 

19 

19 

19 

20 

20 

22 

23 

24 

25 

27 

28 

29 

29 

29 

31 

31 

32 

33 

34 

34 

36 

36 

37 

40 

41 

41 

42 

43 

45 

47 

48 

49 

50 

53 

54 

54 

55 

56 

59 

59 

61 

61 

65 

66 

66 

70 

72 

75 

78 

78 

78 

80 

80 

81 

88 

91 

92 

93 

93 

95 

95 

96 

96 

97 

99 

101 

108 

110 

112 

113 

114 

120 

120 

123 

123 

124 

124 

125 

127 

129 

131 

134 

137 

139 

143 

144 

145 

151 

154 

156 

157 

176 

182 

186 

187 

188 

189 

190 

193 

194 

197 

202 

203 

208 

215 

216 

217 

217 

217 

218 

224 

225 

228 

232 

233 

250 

255 

275 

275 

275 

276 

286 

292 

307 

307 

312 

312 

315 

324 

326 

326 

329 

330 

336 

345 

348 

354 

361 

364 

368 

378 

388 

420 

431 

456 

462 

467 

498 

517 

536 

538 

566 

632 

644 

745 

806 

826 

871 

952 

1205 

1312 

1358 

1630 

1643 

2366 

Source:  R.G.  Jarrett.  A  note  on  the  intervals  between  coal  mining  disasters. 
Biometrika,  66:191-193,  1979;  by  permission  of  the  Biometrika  Trustees. 


a.  Compute  the  height  on  each  bin  of  the  histogram  with  bins  [0,250], 
(250, 500],...,  (2250, 2500]. 

b.  Make  a  sketch  of  the  histogram.  Would  you  view  the  dataset  as  being 
symmetric  or  skewed? 

15.4  □  The  ordered  software  data  (see  also  Table  15.3)  are  given  in  the  fol¬ 
lowing  list. 


0 

0 

0 

2 

4 

6 

8 

9 

10 

10 

10 

12 

15 

15 

16 

21 

22 

24 

26 

30 

30 

31 

33 

36 

44 

50 

55 

58 

65 

68 

75 

77 

79 

81 

88 

91 

97 

100 

108 

108 

112 

113 

114 

115 

120 

122 

129 

134 

138 

143 

148 

160 

176 

180 

193 

193 

197 

227 

232 

233 

236 

242 

245 

255 

261 

263 

281 

290 

296 

300 

300 

325 

330 

357 

365 

369 

371 

379 

386 

422 

445 

446 

447 

452 

457 

482 

529 

529 

543 

600 

648 

670 

700 

707 

724 

729 

748 

790 

810 

816 

828 

843 

860 

865 

868 

875 

943 

948 

983 

990 

1011 

1045 

1064 

1071 

1082 

1146 

1160 

1222 

1247 

1351 

1435 

1461 

1755 

1783 

1800 

1864 

1897 

2323 

2930 

3110 

3321 

4116 

5485 

5509 

6150 
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a.  Compute  the  heights  on  each  bin  of  the  histogram  with  bins  [0,  500], 
(500,  1000],  and  so  on. 

b.  Compute  the  value  of  the  empirical  distribution  function  in  the  endpoints 
of  the  bins. 

c.  Check  that  the  area  under  the  histogram  on  bin  (1000,  1500]  is  equal  to 
the  increase  F„(1500)  —  if,i(1000)  of  the  empirical  distribution  function 
on  this  bin.  Actually,  this  is  true  for  each  single  bin  (see  Exercise  15.11). 

15.5  □  Suppose  we  construct  a  histogram  with  bins  [0,1],  (1,3],  (3,5],  (5,8], 
(8,11],  (11,14],  and  (14,18].  Given  are  the  values  of  the  empirical  distribution 
function  at  the  boundaries  of  the  bins: 

t  0  1  3  5  8  11  14  18 

F„(t)  0  0.225  0.445  0.615  0.735  0.805  0.910  1.000 

Compute  the  height  of  the  histogram  on  each  bin. 

15.6  ffl  Given  is  the  following  information  about  a  histogram: 


Bin 

Height 

(0,2] 

0.245 

(2,4] 

0.130 

(4,7] 

0.050 

(7,11] 

0.020 

(11,15] 

0.005 

Compute  the  value  of  the  empirical  distribution  function  in  the  point  t  =  7. 

15.7  In  Exercise  15.2  a  histogram  was  constructed  for  the  Challenger  data.  On 
which  bin  does  the  empirical  distribution  function  have  the  largest  increase? 

15.8  Define  a  function  K  by 

K{u)  =  cos(7ru)  for  —  1  <  u  <  1 

and  K{u)  =  0  elsewhere.  Check  whether  K  satisfies  the  conditions  (K1)-(K3) 
for  a  kernel  function. 

15.9  On  the  basis  of  the  duration  of  an  eruption  of  the  Old  Faithful  geyser, 
park  rangers  try  to  predict  the  waiting  time  to  the  next  eruption.  In  Fig¬ 
ure  15.13  a  scatterplot  is  displayed  of  the  duration  and  the  time  to  the  next 
eruption  in  seconds. 

a.  Does  the  scatterplot  give  reason  to  believe  that  the  duration  of  an  eruption 
influences  the  time  to  the  next  eruption? 
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Fig.  15.13.  Scatterplot  of  the  Old  Faithful  data. 


b.  Suppose  you  have  just  observed  an  eruption  that  lasted  250  seconds.  What 
would  you  predict  for  the  time  to  the  next  eruption? 

c.  The  dataset  of  durations  shows  two  modes,  i.e.,  there  are  two  places  where 
the  data  accumulate  (see,  for  instance,  the  histogram  in  Figure  15.1).  How 
many  modes  does  the  dataset  of  waiting  times  show? 

15.10  Figure  15.14  displays  the  graph  of  an  empirical  distribution  function 
of  a  dataset  consisting  of  200  elements.  How  many  modes  does  the  dataset 
show? 


Fig.  15.14.  Empirical  distribution  function. 


15.11  ffl  Given  is  a  histogram  and  the  empirical  distribution  function  of 
the  same  dataset.  Show  that  the  height  of  the  histogram  on  a  bin  (a,  5]  is 
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equal  to 


Fn{h)  -  Fnja) 
b  —  a 

15.12  ffl  Let  fn^h  be  a  kernel  estimate.  As  mentioned  in  Section  15.3, 
itself  is  a  probability  density. 

a.  Show  that  the  corresponding  expectation  is  equal  to 


fn 


,h 


dt 


=  Xn- 


Hint:  you  might  consult  the  solution  to  Quick  exercise  15.4. 
b.  Show  that  the  second  moment  corresponding  to  fn,h  satisfies 


t^fn,h{t)  dt 


1  ^  /*c>o 

—  ^  xl  +  h^  /  v?K (u)  du. 
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Exploratory  data  analysis:  numerical 
summaries 


The  classical  way  to  describe  important  features  of  a  dataset  is  to  give  several 
numerical  summaries.  We  discuss  numerical  summaries  for  the  center  of  a 
dataset  and  for  the  amount  of  variability  among  the  elements  of  a  dataset,  and 
then  we  introduce  the  notion  of  quantiles  for  a  dataset.  To  distinguish  these 
quantities  from  corresponding  notions  for  probability  distributions  of  random 
variables,  we  will  often  add  the  word  sample  or  empirical;  for  instance,  we  will 
speak  of  the  sample  mean  and  empirical  quantiles.  We  end  this  chapter  with 
the  boxplot,  which  combines  some  of  the  numerical  summaries  in  a  graphical 
display. 


16.1  The  center  of  a  dataset 


The  best-known  method  to  identify  the  center  of  a  dataset  is  to  compute  the 
sample  mean 


Xn 


Xi  +  X2  +  ■  ■  ■  +  Xn 


(16.1) 


n 

For  the  sake  of  notational  convenience  we  will  sometimes  drop  the  subscript  n 
and  write  x  instead  of  Xn-  The  following  dataset  consists  of  hourly  tempera¬ 
tures  in  degrees  Fahrenheit  (rounded  to  the  nearest  integer) ,  recorded  at  Wick 
in  northern  Scotland  from  5  p.m.  December  31,  1960,  to  3  a.m.  January  1, 
1961.  The  sample  mean  of  the  11  measurements  is  equal  to  44.7. 


43  43  41  41  41  42  43  58  58  41  41 

Source:  V.  Barnett  and  T.  Lewis.  Outliers  in  statistical  data.  Third  edition, 

1994.  (c)  John  Wiley  Sz  Sons  Limited.  Reproduced  with  permission. 

Another  way  to  identify  the  center  of  a  dataset  is  by  means  of  the  sample 
median,  which  we  will  denote  by  yLed{xi,X2,  ■  ■  ■  ,Xn)  or  briefly  Med„.  The 
sample  median  is  defined  as  the  middle  element  of  the  dataset  when  it  is  put 
in  ascending  order.  When  n  is  odd,  it  is  clear  what  this  means.  When  n  is  even. 
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we  take  the  average  of  the  two  middle  elements.  For  the  Wick  temperature 
data  the  sample  median  is  equal  to  42. 

Quick  exercise  16.1  Compute  the  sample  mean  and  sample  median  of  the 
dataset 


4.6  3.0  3.2  4.2  5.0. 

Both  methods  have  pros  and  cons.  The  sample  mean  is  the  natural  analogue 
for  a  dataset  of  what  the  expectation  is  for  a  probability  distribution.  However, 
it  is  very  sensitive  to  outliers,  by  which  we  mean  observations  in  the  dataset 
that  deviate  a  lot  from  the  bulk  of  the  data. 

To  illustrate  the  sensitivity  of  the  sample  mean,  consider  the  Wick  tempera¬ 
ture  data  displayed  in  Figure  16.1.  The  values  58  and  58  recorded  at  midnight 
and  1  a.m.  are  clearly  far  from  the  bulk  of  the  data  and  give  grounds  for 
concern  whether  they  are  genuine  (58  degrees  Fahrenheit  seems  very  warm 
at  midnight  for  New  Year’s  in  northern  Scotland).  To  investigate  their  effect 
on  the  sample  mean  we  compute  the  average  of  the  data,  leaving  out  these 
measurements,  which  gives  41.8  (instead  of  44.7).  The  sample  median  of  the 
data  is  equal  to  41  (instead  of  42)  when  leaving  out  the  measurements  with 
value  58.  The  median  is  more  robust  in  the  sense  that  it  is  hardly  affected  by 
a  few  outliers. 
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Fig.  16.1.  The  Wick  temperature  data. 
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It  should  be  emphasized  that  this  discussion  is  only  meant  to  illustrate  the 
sensitivity  of  the  sample  mean  and  by  no  means  is  intended  to  suggest  we  leave 
out  measurements  that  deviate  a  lot  from  the  bulk  of  the  data!  It  is  important 
to  be  aware  of  the  presence  of  an  outlier.  In  that  case,  one  could  try  to  find  out 
whether  there  is  perhaps  something  suspicious  about  this  measurement.  This 
might  lead  to  assigning  a  smaller  weight  to  such  a  measurement  or  even  to 
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removing  it  from  the  dataset.  However,  sometimes  it  is  possible  to  reconstruct 
the  exact  circumstances  and  correct  the  measurement.  For  instance,  after 
further  inquiry  in  the  temperature  example  it  turned  out  that  at  midnight 
the  meteorological  office  changed  its  recording  unit  from  degrees  Fahrenheit 
to  1/lOth  degree  Celsius  (so  58  and  41  should  read  5.8°C  and  4.1°C).  The 
corrected  values  in  degrees  Fahrenheit  (to  the  nearest  integer)  are 

43  43  41  41  41  42  43  1^2  ^2  39  39. 

For  the  corrected  data  the  sample  mean  is  41.5  and  the  sample  median  is  42. 

Quick  exercise  16.2  Consider  the  same  dataset  as  in  Quick  exercise  16.1. 
Suppose  that  someone  misreads  the  dataset  as 

4.6  30  3.2  4.2  50. 

Compute  the  sample  mean  and  sample  median  and  compare  these  values  with 
the  ones  you  found  in  Quick  exercise  16.1. 


16.2  The  amount  of  variability  of  a  dataset 


To  quantify  the  amount  of  variability  among  the  elements  of  a  dataset,  one 
often  uses  the  sample  variance  defined  by 

1  ” 

Sn  =  ^  'y  ~  ^n)  ■ 

i=l 


Up  to  a  scaling  factor  this  is  equal  to  the  average  squared  deviation  from  Xn- 
At  first  sight,  it  seems  more  natural  to  define  the  sample  variance  by 


=  -  V'(a;j  -  Xnf. 
n  ^ ' 

i=l 


Why  we  choose  the  factor  l/(n  —  1)  instead  of  1/n  will  be  explained  later  (see 
Chapter  19).  Because  is  in  different  units  from  the  elements  of  the  dataset, 
one  often  prefers  the  sample  standard  deviation 


— 


n  — 


j'^{Xi-Xny 


i=l 


which  is  measured  in  the  same  units  as  the  elements  of  the  dataset  itself. 
Just  as  the  sample  mean,  the  sample  standard  deviation  is  very  sensitive  to 
outliers.  For  the  (uncorrected)  Wick  temperature  data  the  sample  standard 
deviation  is  6.62,  or  0.97  if  we  leave  out  the  two  measurements  with  value  58. 
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For  the  corrected  data  the  standard  deviation  is  1.44.  A  more  robust  measure 
of  variability  is  the  median  of  absolute  deviations  or  MAD,  which  is  defined 
as  follows.  Consider  the  absolute  deviation  of  every  element  Xi  with  respect 
to  the  sample  median: 


\xi  -  Med(a;i,a;2, . . .  ,a;„)| 

or  briefly 

|xj  -  Med„|. 

The  MAD  is  obtained  by  taking  the  median  of  all  these  absolute  deviations 

MAD(a;i,X2,  ...,Xn)  =  Med(|a:i  -  Med„|,  ...,\xn-  Med„|).  (16.2) 

Quick  exercise  16.3  Compute  the  sample  standard  deviation  for  the  dataset 
of  Quick  exercise  16.1  for  which  it  is  given  that  the  values  of  Xi  —  Xn  are: 

-1.0,  0.6,  -0.8,  0.2,  1.0. 

Also  compute  the  MAD  for  this  dataset. 

Just  as  the  sample  median,  the  MAD  is  hardly  affected  by  outliers.  For  the 
(uncorrected)  Wick  temperature  data  the  MAD  is  1  and  equal  to  0  if  we  leave 
out  the  two  measurements  with  value  58  (the  value  0  seems  a  bit  strange, 
but  is  a  consequence  of  the  fact  that  the  observations  are  given  in  degrees 
Fahrenheit  rounded  to  the  nearest  integer).  For  the  corrected  data  the  MAD 
is  1. 

Quick  exercise  16.4  Compute  the  sample  standard  deviation  for  the  mis¬ 
read  dataset  of  Quick  exercise  16.2  for  which  it  is  given  that  the  values  of 
Xi  —  Xn  are: 

11.6,  -13.8,  -15.2,  -14.2,  31.6. 

Also  compute  the  MAD  for  this  dataset  and  compare  both  values  with  the 
ones  you  found  in  Quick  exercise  16.3. 


16.3  Empirical  quantiles,  quartiles,  and  the  IQR 

The  sample  median  divides  the  dataset  in  two  more  or  less  equal  parts:  about 
half  of  the  elements  are  less  than  the  median  and  about  half  of  the  elements 
are  greater  than  the  median.  More  generally,  we  can  divide  the  dataset  in 
two  parts  in  such  a  way  that  a  proportion  p  is  less  than  a  certain  number 
and  a  proportion  1  —  p  is  greater  than  this  number.  Such  a  number  is  called 
the  lOOp  empirical  percentile  or  the  pth  empirical  quantile  and  is  denoted  by 
Qnip)-  For  a  suitable  introduction  of  empirical  quantiles  we  need  the  notion 
of  order  statistics. 
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The  order  statistics  consist  of  the  same  elements  as  in  the  original  dataset 
xi,X2,  •  ■  ■ ,  Xn,  but  in  ascending  order.  Denote  by  X(k)  the  kth.  element  in  the 
ordered  list.  Then 

^(1)  ^  ^(2)  —  ‘  ‘  ‘  —  ^(n) 

are  called  the  order  statistics  of  xi ,  X2, . . . ,  Xn-  The  order  statistics  of  the  Wick 
temperature  data  are 

41  41  41  41  41  42  43  43  43  58  58. 

Note  that  by  putting  the  elements  in  order,  it  is  possible  that  successive  order 
statistics  are  the  same,  for  instance,  Xfj-^  =  ■  ■  ■  =  a;(5)  =  41.  Another  example 
is  Table  15.2,  which  lists  the  order  statistics  of  the  Old  Faithful  dataset. 

To  compute  empirical  quantiles  one  linearly  interpolates  between  order  statis¬ 
tics  of  the  dataset.  Let  0  <  p  <  1,  and  suppose  we  want  to  compute  the  pth 
empirical  quantile  for  a  dataset  xi,X2,  ■  ■  ■  ,Xn-  The  following  computation  is 
based  on  requiring  that  the  Ah  order  statistic  is  the  i/{n+  1)  quantile.  If  we 
denote  the  integer  part  of  a  by  [aj ,  then  the  computation  of  (/„  ip)  runs  as 
follows: 

Qn{p)  ^(fc)  4”  ^(^(/c-t-1)  ^(fc)) 

with  k  =  [p{n  +  1)J  and  a  =  p{n  -|-  1)  —  fc.  On  the  left  in  Figure  16.2  the 
relation  between  the  pth  quantile  and  the  empirical  distribution  function  is 
illustrated  for  the  Old  Faithful  data. 


quartile  quartile 

Fig.  16.2.  Empirical  quantile  and  quartiles  for  the  Old  Faithful  data. 


Quick  exercise  16.5  Compute  the  55th  empirical  percentile  for  the  Wick 
temperature  data. 
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Lower  and  upper  quartiles 

Instead  of  identifying  only  the  center  of  the  dataset,  Tukey  [35]  suggested 
to  give  a  five-number  summary  of  the  dataset:  the  minimum,  the  maximum, 
the  sample  median,  and  the  25th  and  75th  empirical  percentiles.  The  25th 
empirical  percentile  (7„(0.25)  is  called  the  lower  quartile  and  the  75th  empirical 
percentile  g'„(0.75)  is  called  the  upper  quartile.  Together  with  the  median,  the 
lower  and  upper  quartiles  divide  the  dataset  in  four  more  or  less  equal  parts 
consisting  of  about  one  quarter  of  the  number  of  elements.  The  relation  of 
the  two  quartiles  and  the  median  with  the  empirical  distribution  function  is 
illustrated  for  the  Old  Faithful  data  on  the  right  of  Figure  16.2.  The  distance 
between  the  lower  quartile  and  the  median,  relative  to  the  distance  between 
the  upper  quartile  and  the  median,  gives  some  indication  on  the  skewness  of 
the  dataset.  The  distance  between  the  upper  and  lower  quartiles  is  called  the 
interquartile  range,  or  IQR: 

IQR  =  g„(0.75)-g„(0.25). 

The  IQR  specifies  the  range  of  the  middle  half  of  the  dataset.  It  could  also 
serve  as  a  robust  measure  of  the  amount  of  variability  among  the  elements  of 
the  dataset.  For  the  Old  Faithful  data  the  five-number  summary  is 

Minimum  Lower  quartile  Median  Upper  quartile  Maximum 
96  129.25  240  267.75  306 

and  the  IQR  is  138.5. 

Quick  exercise  16.6  Compute  the  five-number  summary  for  the  (uncor¬ 
rected)  Wick  temperature  data. 


16.4  The  box-and- whisker  plot 

Tukey  [35]  also  proposed  visualizing  the  five-number  summary  discussed  in 
the  previous  section  by  a  so-called  box-and-whisker  plot,  briefly  boxplot.  Fig¬ 
ure  16.3  displays  a  boxplot.  The  data  are  now  on  the  vertical  axis,  where  we 
left  out  the  numbers  on  the  axis  in  order  to  explain  the  construction  of  the 
figure.  The  horizontal  width  of  the  box  is  irrelevant.  In  the  vertical  direction 
the  box  extends  from  the  lower  to  the  upper  quartile,  so  that  the  height  of  the 
box  is  precisely  the  IQR.  The  horizontal  line  inside  the  box  corresponds  to  the 
sample  median.  Up  from  the  upper  quartile  we  measure  out  a  distance  of  1.5 
times  the  IQR  and  draw  a  so-called  whisker  up  to  the  largest  observation  that 
lies  within  this  distance,  where  we  put  a  horizontal  line.  Similarly,  down  from 
the  lower  quartile  we  measure  out  a  distance  of  1.5  times  the  IQR  and  draw 
a  whisker  to  the  smallest  observation  that  lies  within  this  distance,  where 
we  also  put  a  horizontal  line.  All  other  observations  beyond  the  whiskers  are 
marked  by  o.  Such  an  observation  is  called  an  outlier. 


16.4  The  box-and-whisker  plot 
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In  Figure  16.4  the  boxplots  of  the  Old  Faithful  data  and  of  the  software  relia¬ 
bility  data  (see  also  Chapter  15)  are  displayed.  The  skewness  of  the  software 
reliability  data  produces  a  boxplot  with  whiskers  of  very  different  length  and 
with  several  observations  beyond  the  upper  quartile  plus  1.5  times  the  IQR. 
The  boxplot  of  the  Old  Faithful  data  illustrates  one  of  the  shortcomings  of  the 
boxplot;  it  does  not  capture  the  fact  that  the  data  show  two  separate  peaks. 
However,  the  position  of  the  sample  median  inside  the  box  does  suggest  that 
the  dataset  is  skewed. 

Quick  exercise  16.7  Suppose  we  want  to  construct  a  boxplot  of  the  (uncor¬ 
rected)  Wick  temperature  data.  What  is  the  height  of  the  box,  the  length  of 
both  whiskers,  and  which  measurements  fall  outside  the  box  and  whiskers? 
Would  you  consider  the  two  values  58  extreme  outliers? 
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Fig.  16.4.  Boxplot  of  the  Old  Faithful  data  and  the  software  data. 
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Using  boxplots  to  compare  several  datasets 

Although  the  boxplot  provides  some  information  about  the  structure  of  the 
data,  such  as  center,  range,  skewness  or  symmetry,  it  is  a  poor  graphical 
display  of  the  dataset.  Graphical  summaries  such  as  the  histogram  and  kernel 
density  estimate  are  more  informative  displays  of  a  single  dataset.  Boxplots 
become  useful  if  we  want  to  compare  several  sets  of  data  in  a  simple  graphical 
display.  In  Figure  16.5  boxplots  are  displayed  of  the  average  drill  time  for 
dry  and  wet  drilling  up  to  a  depth  of  250  feet  for  the  drill  data  discussed  in 
Section  15.5  (see  also  Table  15.4).  It  is  clear  that  the  boxplot  corresponding 
to  dry  drilling  differs  from  that  corresponding  to  wet  drilling.  However,  the 
question  is  whether  this  difference  can  still  be  attributed  to  chance  or  is  caused 
by  the  drilling  technique  used.  We  will  return  to  this  type  of  question  in 
Chapter  25. 


Fig.  16.5.  Boxplot  of  average  drill  times. 


16.5  Solutions  to  the  quick  exercises 


16.1  The  average  is 

4.6 +  3.0 +  3.2 +  4.2 +  5.0  20  , 

*"  = - 5 - = 

The  median  is  the  middle  element  of  3.0,  3.2,  4.2,  4.6,  and  5.0,  which  gives 
Med„  =  4.2. 


16.2  The  average  is 


4.6  +  30  +  3.2  +  4.2  +  50 


5 
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which  differs  14.4  from  the  average  we  found  in  Quick  exercise  16.1.  The 
median  is  the  middle  element  of  3.2,  4.2,  4.6,  30,  and  50.  This  gives  Med„  = 
4.6,  which  only  differs  0.4  from  the  median  we  found  in  Quick  exercise  16.1. 
As  one  can  see,  the  median  is  hardly  affected  by  the  two  outliers. 

16.3  The  sample  variance  is 

2  (-1)2 +  (0.6)2 +  (-0.8)2 +  (0.2)2 +  (1.0)2  3  04  ^ 

Su-  -  4  -u.tt) 

so  that  the  sample  standard  deviation  is  s„  =  >70.76  =  0.872.  The  median  is 
4.2,  so  that  the  absolute  deviations  from  the  median  are  given  by 

0.4  1.2  1.0  0.0  0.8. 

The  MAD  is  the  median  of  these  numbers,  which  is  0.8. 

16.4  The  sample  variance  is 

2  (11.6)2  +  (-13.8)2  +  (-15.2)2  +  (-14.2)2  (31,5)2 

so  that  the  sample  standard  deviation  is  s„  =  >/439.06  = 
difference  of  20.19  from  the  value  we  found  in  Quick  exercise 
is  4.6,  so  that  the  absolute  deviations  from  the  median  are  given  by 

0.0  25.4  1.4  0.4  45.4. 

The  MAD  is  the  median  of  these  numbers,  which  is  1.4.  Just  as  the  median, 
the  MAD  is  hardly  affected  by  the  two  outliers. 

16.5  We  have  k  =  [0.55  •  12J  =  [6.6J  =  6,  so  that  a  =  0.6.  This  gives 

9„(0.55)  =  a;(6)  +  0.6  •  —  a;(6))  =  42  +  0.6  •  (43  —  42)  =  42.6. 

16.6  From  the  order  statistics  of  the  Wick  temperature  data 

41  41  41  41  41  42  43  43  43  58  58 

it  can  be  seen  immediately  that  minimum,  maximum,  and  median  are  given  by 
41,  58,  and  42.  For  the  lower  quartile  we  have  k  =  [0.25 T2J  =  3,  so  that  a  =  0 
and  (7ri(0.25)  =  =  41.  For  the  upper  quartile  we  have  k  =  [0.75  •  12J  =  9, 

so  that  again  a  =  0  and  g„(0.75)  =  a:(9)  =  43.  Hence  for  the  Wick  temperature 
data  the  five-number  summary  is 


Minimum 

Lower  quartile 

Median 

Upper  quartile 

Maximum 

41 

41 

42 

43 

58 

1756.24 


=  439.06 


20.95,  which  is  a 
16.3.  The  median 
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16.7  From  the  five-number  summary  for  the  Wick  temperature  data  (see 
Quick  exercise  16.6),  it  follows  immediately  that  the  height  of  the  box  is  the 
IQR:  43  —  41  =  2.  If  we  measure  out  a  distance  of  1.5  times  2  down  from  the 
lower  quartile  41,  we  see  that  the  smallest  observation  within  this  range  is 
41,  which  means  that  the  lower  whisker  has  length  zero.  Similarly,  the  upper 
whisker  has  length  zero.  The  two  measurements  with  value  58  are  outside  the 
box  and  whiskers.  The  two  values  58  are  clearly  far  away  from  the  bulk  of  the 
data  and  should  be  considered  extreme  outliers. 


16.6  Exercises 

16.1  □  Use  the  order  statistics  of  the  software  data  as  given  in  Exercise  15.4 
to  answer  the  following  questions. 

a.  Compute  the  sample  median. 

b.  Compute  the  lower  and  upper  quartiles  and  the  IQR. 

c.  Compute  the  37th  empirical  percentile. 

16.2  Compute  for  the  Old  Faithful  data  the  distance  of  the  lower  and  upper 
quartiles  to  the  median  and  explain  the  difference. 

16.3  ffl  Recall  the  example  about  the  space  shuttle  Challenger  in  Section  1.4. 
The  following  table  lists  the  order  statistics  of  launch  temperatures  during 
take-offs  in  degrees  Fahrenheit,  including  the  launch  temperature  on  Jan¬ 
uary  28,  1986. 

31  53  57  58  63  66  67  67  67  68  69  70 

70  70  70  72  73  75  75  76  76  78  79  81 

a.  Find  the  sample  median  and  the  lower  and  upper  quartiles. 

b.  Sketch  the  boxplot  of  this  dataset. 
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c.  On  January  28,  1986,  the  launch  temperature  was  31  degrees  Fahrenheit. 
Comment  on  the  value  31  with  respect  to  the  other  data  points. 

16.4  □  The  sample  mean  and  sample  median  of  the  uncorrected  Wick  tem¬ 
perature  data  (in  degrees  Fahrenheit)  are  44.7  and  42.  We  transform  the  data 
from  degrees  Fahrenheit  (a;,)  to  degrees  Celsius  (yi)  by  means  of  the  formula 

5 

Vi  ~  9^^*  32)5 

which  gives  the  following  dataset 

55  55  c  r  r  50  55  130  130  r  r 

ggOOOgg  g  gOO. 

a.  Check  that  =  |(a;„  —  32). 

b.  Is  it  also  true  that  Med(?/i, . . .  ,?/„)  =  |(Med(a;i, . . .  ,Xn)  —  32)? 

c.  Suppose  we  have  a  dataset  xi,X2,  ■  ■  ■  jXn  and  construct  yi,y2,  ■  ■  ■  ,yn 
where  yi  =  axi  +  b  with  a  and  b  being  real  numbers.  Do  similar  rela¬ 
tions  hold  for  the  sample  mean  and  sample  median?  If  so,  state  them. 

16.5  Consider  the  uncorrected  Wick  temperature  data  in  degrees  Fahrenheit 
(xi)  and  the  corresponding  temperatures  in  degrees  Celsius  (j/i)  as  given  in 
Exercise  16.4.  The  sample  standard  deviation  and  the  MAD  for  the  Wick  data 
are  6.62  and  1. 

a.  Let  sf  and  sc  denote  the  sample  standard  deviations  of  xitX2,  ■  ■  ■  ,Xn 
and  yi,y2,  ■  ■  ■  ,yn  respectively.  Check  that  sc  =  fsF- 

b.  Let  MADf  and  MADc  denote  the  MAD  ofxi,X2,---,Xn  and  ?/i,  j/2,  •  ■  ■ ,  yn 
respectively.  Is  it  also  true  that  MADc  =  |MADf? 

c.  Suppose  we  have  a  dataset  xi,X2,  ■  ■  ■  ,Xn  and  construct  yi,  1/2,  ■  ■  • ,  J/n 
where  yi  =  aXi  -I-  b  with  a  and  b  being  real  numbers.  Do  similar  rela¬ 
tions  hold  for  the  sample  standard  deviation  and  the  MAD?  If  so,  state 
them. 

16.6  ffl  Consider  two  datasets:  1, 5,  9  and  2, 4, 6,  8. 

a.  Denote  the  sample  means  of  the  two  datasets  by  x  and  y.  Is  it  true  that  the 
average  (x  +  y) /2  of  x  and  y  is  equal  to  the  sample  mean  of  the  combined 
dataset  with  7  elements? 

b.  Suppose  we  have  two  other  datasets:  one  of  size  n  with  sample  mean 
Xn  and  another  dataset  of  size  m  with  sample  mean  y^.  Is  it  always 
true  that  the  average  {xn  -I-  ym)/2  of  and  j/m  is  equal  to  the  sample 
mean  of  the  combined  dataset  with  n  -|-  m  elements?  If  no,  then  provide 
a  counterexample.  If  yes,  then  explain  this. 

c.  If  m  =  n,  is  (a;„-|-ym)/2  equal  to  the  sample  mean  of  the  combined  dataset 
with  n  -|-  TO  elements? 
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16.7  Consider  the  two  datasets  from  Exercise  16.6. 

a.  Denote  the  sample  medians  of  the  two  datasets  by  Med^;  and  Medy.  Is  it 
true  that  the  sample  median  (Meda;  +Medy)/2  of  the  two  sample  medians 
is  equal  to  the  sample  median  of  the  combined  dataset  with  7  elements? 

b.  Suppose  we  have  two  other  datasets:  one  of  size  n  with  sample  median 
Meda;  and  another  dataset  of  size  m  with  sample  median  Medy.  Is  it 
always  true  that  the  sample  median  (Meda;  +  Medy)/2  of  the  two  sample 
medians  is  equal  to  the  sample  median  of  the  combined  dataset  with  n+m 
elements?  If  no,  then  provide  a  counterexample.  If  yes,  then  explain  this. 

c.  What  if  TO  =  n? 

16.8  ffl  Compute  the  MAD  for  the  combined  dataset  of  7  elements  from  Ex¬ 
ercise  16.6. 

16.9  Consider  a  dataset  xi,X2,  ■  ■  ■  ,Xn  with  Xi  yf  0.  We  construct  a  second 

dataset  j/i,  ?/2,  •  •  ■ ,  2/n,  where 

1 


a.  Suppose  dataset  xi,X2,  ■  ■  ■  ,Xn  consists  of  —6,1,15.  Is  it  true  that  = 
1/^3? 

b.  Suppose  that  n  is  odd.  Is  it  true  that  ijn  =  ^jxrP- 

c.  Suppose  that  n  is  odd  and  each  Xi  >  0.  Is  it  true  that  Med(yi, . . . ,  yn)  = 
1/Med(a;i, . . . ,  a;„)?  What  about  when  n  is  even? 

16.10  □  A  method  to  investigate  the  sensitivity  of  the  sample  mean  and  the 

sample  median  to  extreme  outliers  is  to  replace  one  or  more  elements  in  a 

given  dataset  by  a  number  y  and  investigate  the  effect  when  y  goes  to  infinity. 

To  illustrate  this,  consider  the  dataset  from  Quick  Exercise  16.1: 

4.6  3.0  3.2  4.2  5.0 

with  sample  mean  4  and  sample  median  4.2. 

a.  We  replace  the  element  3.2  by  some  real  number  y.  What  happens  with 
the  sample  mean  and  the  sample  median  of  this  new  dataset  as  y  —f  oo? 

b.  We  replace  a  number  of  elements  by  some  real  number  y.  How  many 
elements  do  we  need  to  replace  so  that  the  sample  median  of  the  new 
dataset  goes  to  infinity  as  y  ^  oo? 

c.  Suppose  we  have  another  dataset  of  size  n.  How  many  elements  do  we 
need  to  replace  by  some  real  number  y,  so  that  the  sample  mean  of  the 
new  dataset  goes  to  infinity  as  y  oo?  And  how  many  elements  do  we 
need  to  replace,  so  that  the  sample  median  of  the  new  dataset  goes  to 
infinity? 
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16.11  Just  as  in  Exercise  16.10  we  investigate  the  sensitivity  of  the  sample 
standard  deviation  and  the  MAD  to  extreme  outliers,  by  considering  the  same 
dataset  with  sample  standard  deviation  0.872  and  MAD  equal  to  0.8.  Answer 
the  same  three  questions  for  the  sample  standard  deviation  and  the  MAD 
instead  of  the  sample  mean  and  sample  median. 

16.12  □  Compute  the  sample  mean  and  sample  median  for  the  dataset 


1,2, ...,iV 

in  case  N  is  odd  and  in  case  N  is  even.  You  may  use  the  fact  that 

1  o  Ar  N{N+1) 

l  +  2+---  +  N= 

16.13  Compute  the  sample  standard  deviation  and  MAD  for  the  dataset 

-M..., -1,0,1 . N. 


You  may  use  the  fact  that 

6 

16.14  Check  that  the  50th  empirical  percentile  is  the  sample  median. 

16.15  ffl  The  following  rule  is  useful  for  the  computation  of  the  sample  vari¬ 
ance  (and  standard  deviation).  Show  that 


1 

n 


n 

^(a:*  -  Xnf 
2=1 


where  a;„  =  Xz)/n. 

16.16  Recall  Exercise  15.12,  where  we  computed  the  mean  and  second  mo¬ 
ment  corresponding  to  a  density  estimate  fn,h-  Show  that  the  variance  corre¬ 
sponding  to  fn,h  satisfies: 

/OO  /  nOO  \  2  \  ^ 

=  -'^{Xi-Xn)'^  +  h'^  U^K{u)du. 
-OO  \J —  OQ  J  Xi  J  —  OQ 

16.17  Suppose  we  have  a  dataset  xi,X2,  ■  ■  ■  ,Xn-  Check  that  if  p  =  i/{n+  1) 
the  pth  empirical  quantile  is  the  Ah  order  statistic. 
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In  this  chapter  we  introduce  a  common  statistical  model.  It  corresponds  to 
the  situation  where  the  elements  of  the  dataset  are  repeated  measurements 
of  the  same  quantity  and  where  different  measurements  do  not  influence  each 
other.  Next,  we  discuss  the  probability  distribution  of  the  random  variables 
that  model  the  measurements  and  illustrate  how  sample  statistics  can  help 
to  select  a  suitable  statistical  model.  Finally,  we  discuss  the  simple  linear 
regression  model  that  corresponds  to  the  situation  where  the  elements  of  the 
dataset  are  paired  measurements. 


17.1  Random  samples  and  statistical  models 

In  Chapter  I  we  briefly  discussed  Michelson’s  experiment  conducted  between 
June  5  and  July  2  in  1879,  in  which  100  measurements  were  obtained  on  the 
speed  of  light.  The  values  are  given  in  Table  17.1  and  represent  the  speed 
of  light  in  air  in  km/sec  minus  299  000.  The  variation  among  the  100  values 
suggests  that  measuring  the  speed  of  light  is  subject  to  random  influences.  As 
we  have  seen  before,  we  describe  random  phenomena  by  means  of  a  probability 
model,  i.e.,  we  interpret  the  outcome  of  an  experiment  as  a  realization  of 
some  random  variable.  Hence  the  first  measurement  is  modeled  by  a  random 
variable  Xi  and  the  value  850  is  interpreted  as  the  realization  of  Xi.  Similarly, 
the  second  measurement  is  modeled  by  a  random  variable  X2  and  the  value  740 
is  interpreted  as  the  realization  of  X2  ■  Since  both  measurements  are  obtained 
under  the  same  experimental  conditions,  it  is  justified  to  assume  that  the 
probability  distributions  of  Xi  and  X2  are  the  same.  More  generally,  the  100 
measurements  are  modeled  by  random  variables 

Xi,X2, . . . ,  Aioo 

with  the  same  probability  distribution,  and  the  values  in  Table  17.1  are  inter¬ 
preted  as  realizations  of  Ai,  A2, . . . ,  Aioq.  Moreover,  because  we  believe  that 
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Table  17.1.  Michelson  data  on  the  speed  of  light. 


850 

740 

900 

1070 

930 

850 

950 

980 

980 

880 

1000 

980 

930 

650 

760 

810 

1000 

1000 

960 

960 

960 

940 

960 

940 

880 

800 

850 

880 

900 

840 

830 

790 

810 

880 

880 

830 

800 

790 

760 

800 

880 

880 

880 

860 

720 

720 

620 

860 

970 

950 

880 

910 

850 

870 

840 

840 

850 

840 

840 

840 

890 

810 

810 

820 

800 

770 

760 

740 

750 

760 

910 

920 

890 

860 

880 

720 

840 

850 

850 

780 

890 

840 

780 

810 

760 

810 

790 

810 

820 

850 

870 

870 

810 

740 

810 

940 

950 

800 

810 

870 

Source:  E.N.  Dorsey.  The  velocity  of  light.  Transactions  of  the  American 
Philosophical  Society.  34(1):1-110,  1944;  Table  22  on  pages  60-61. 


Michelson  took  great  care  not  to  have  the  measurements  influence  each  other, 
the  random  variables  Xi,X2,  ■  ■  ■  ,>^ioo  assumed  to  be  mutually  indepen¬ 
dent  (see  also  Remark  3.1  about  physical  and  stochastic  independence).  Such 
a  collection  of  random  variables  is  called  a  random  sample  or  briefly,  sample. 


Random  sample.  A  random  sample  is  a  collection  of  random  vari¬ 
ables  Xi,  X2,  ■  ■  ■ ,  Xn,  that  have  the  same  probability  distribution 
and  are  mutually  independent. 


If  F  is  the  distribution  function  of  each  random  variable  Xi  in  a  random 
sample,  we  speak  of  a  random  sample  from  F.  Similarly,  we  speak  of  a  random 
sample  from  a  density  /,  a  random  sample  from  an  iV(/x,  P)  distribution,  etc. 

Quick  exercise  17.1  Suppose  we  have  a  random  sample  Xi,X2  from  a  dis¬ 
tribution  with  variance  1.  Compute  the  variance  of  Xi  -|-  A2. 

Properties  that  are  inherent  to  the  random  phenomenon  under  study  may 
provide  additional  knowledge  about  the  distribution  of  the  sample.  Recall 
the  software  data  discussed  in  Chapter  15.  The  data  are  observed  lengths  in 
CPU  seconds  between  successive  failures  that  occur  during  the  execution  of 
a  certain  real-time  command.  Typically,  in  a  situation  like  this,  in  a  small 
time  interval,  either  0  or  1  failure  occurs.  Moreover,  failures  occur  with  small 
probability  and  in  disjoint  time  intervals  failures  occur  independent  of  each 
other.  In  addition,  let  us  assume  that  the  rate  at  which  the  failures  occur 
is  constant  over  time.  According  to  Chapter  12,  this  justifies  the  choice  of 
a  Poisson  process  to  model  the  series  of  failures.  From  the  properties  of  the 
Poisson  process  we  know  that  the  interfailure  times  are  independent  and  have 
the  same  exponential  distribution.  Hence  we  model  the  software  data  as  the 
realization  of  a  random  sample  from  an  exponential  distribution. 
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In  some  cases  we  may  not  be  able  to  specify  the  type  of  distribution.  Take,  for 
instance,  the  Old  Faithful  data  consisting  of  observed  durations  of  eruptions 
of  the  Old  Faithful  geyser.  Due  to  lack  of  specific  geological  knowledge  about 
the  subsurface  and  the  mechanism  that  governs  the  eruptions,  we  prefer  not  to 
assume  a  particular  type  of  distribution.  However,  we  do  model  the  durations 
as  the  realization  of  a  random  sample  from  a  continuous  distribution  on  (0,  oo). 
In  each  of  the  three  examples  the  dataset  was  obtained  from  repeated  mea¬ 
surements  performed  under  the  same  experimental  conditions.  The  basic  sta¬ 
tistical  model  for  such  a  dataset  is  to  consider  the  measurements  as  a  random 
sample  and  to  interpret  the  dataset  as  the  realization  of  the  random  sample. 
Knowledge  about  the  phenomenon  under  study  and  the  nature  of  the  experi¬ 
ment  may  lead  to  partial  specification  of  the  probability  distribution  of  each 
Xi  in  the  sample.  This  should  be  included  in  the  model. 


Statistical  model  for  repeated  measurements.  A  dataset 
consisting  of  values  xi,X2,  ■  ■  ■  ,Xn  of  repeated  measurements  of  the 
same  quantity  is  modeled  as  the  realization  of  a  random  sample 
Xi,  X2,  ■  ■  ■ ,  Xn-  The  model  may  include  a  partial  specification  of 
the  probability  distribution  of  each  Xi. 


The  probability  distribution  of  each  Xi  is  called  the  model  distribution.  Usu¬ 
ally  it  refers  to  a  collection  of  distributions:  in  the  Old  Faithful  example  to 
the  collection  of  all  continuous  distributions  on  (0,oo),  in  the  software  ex¬ 
ample  to  the  collection  of  all  exponential  distributions.  In  the  latter  case  the 
parameter  of  the  exponential  distribution  is  called  the  model  parameter.  The 
unique  distribution  from  which  the  sample  actually  originates  is  assumed  to 
be  one  particular  member  of  this  collection  and  is  called  the  “true”  distribu¬ 
tion.  Similarly,  in  the  software  example,  the  parameter  corresponding  to  the 
“true”  exponential  distribution  is  called  the  “true”  parameter.  The  word  true 
is  put  between  quotation  marks  because  it  does  not  refer  to  something  in  the 
real  world,  but  only  to  a  distribution  (or  parameter)  in  the  statistical  model, 
which  is  merely  an  approximation  of  the  real  situation. 

Quick  exercise  17.2  We  obtain  a  dataset  of  ten  elements  by  tossing  a  coin 
ten  times  and  recording  the  result  of  each  toss.  What  is  an  appropriate  sta¬ 
tistical  model  and  corresponding  model  distribution  for  this  dataset? 

Of  course  there  are  situations  where  the  assumption  of  independence  or  identi¬ 
cal  distributions  is  unrealistic.  In  that  case  a  different  statistical  model  would 
be  more  appropriate.  However,  we  will  restrict  ourselves  mainly  to  the  case 
where  the  dataset  can  be  modeled  as  the  realization  of  a  random  sample. 
Once  we  have  formulated  a  statistical  model  for  our  dataset,  we  can  use  the 
dataset  to  infer  knowledge  about  the  model  distribution.  Important  questions 
about  the  corresponding  model  distribution  are 
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•  which  feature  of  the  model  distribution  represents  the  quantity  of  interest 
and  how  do  we  use  our  dataset  to  determine  a  value  for  this? 

•  which  model  distribution  fits  a  particular  dataset  best? 

These  questions  can  be  diverse,  and  answering  them  may  be  difficult.  For 
instance,  the  Old  Faithful  data  are  modeled  as  a  realization  of  a  random 
sample  from  a  continuous  distribution.  Suppose  we  are  interested  in  a  complete 
characterization  of  the  “true”  distribution,  such  as  the  distribution  function 
F  or  the  probability  density  /.  Since  there  are  no  further  specifications  about 
the  type  of  distribution,  our  problem  would  be  to  estimate  the  eomplete  curve 
of  F  or  f  on  the  basis  of  our  dataset. 

On  the  other  hand,  the  software  data  are  modeled  as  the  realization  of  a 
random  sample  from  an  exponential  distribution.  In  that  case  F  and  /  are 
completely  characterized  by  a  single  parameter  A: 

F{x)  =  1  —  and  f{x)  =  Ae“^“^  for  a;  >  0. 

Even  if  we  are  interested  in  the  curves  of  F  and  /,  our  problem  would  reduce 
to  estimating  a  single  parameter  on  the  basis  of  our  dataset. 

In  other  cases  we  may  not  be  interested  in  the  distribution  as  a  whole,  but 
only  in  a  specific  feature  of  the  model  distribution  that  represents  the  quantity 
of  interest.  For  instance,  in  a  physical  experiment,  such  as  the  one  performed 
by  Michelson,  one  usually  thinks  of  each  measurement  as 

measurement  =  quantity  of  interest  +  measurement  error. 

The  quantity  of  interest,  in  this  case  the  speed  of  light,  is  thought  of  as  being 
some  (unknown)  constant  and  the  measurement  error  is  some  random  fluc¬ 
tuation.  In  the  absence  of  systematic  error,  the  measurement  error  can  be 
modeled  by  a  random  variable  with  zero  expectation  and  finite  variance.  In 
that  case  the  measurements  are  modeled  by  a  random  sample  from  a  distribu¬ 
tion  with  some  unknown  expectation  and  finite  variance.  The  speed  of  light  is 
represented  by  the  expectation  of  the  model  distribution.  Our  problem  would 
be  to  estimate  the  expectation  of  the  model  distribution  on  the  basis  of  our 
dataset. 

In  the  remaining  chapters,  we  will  develop  several  statistical  methods  to  infer 
knowledge  about  the  “true”  distribution  or  about  a  specific  feature  of  it,  by 
means  of  a  dataset.  In  the  remainder  of  this  chapter  we  will  investigate  how 
the  graphical  and  numerical  summaries  of  our  dataset  can  serve  as  a  first 
indication  of  what  an  appropriate  choice  would  be  for  this  distribution  or  for 
a  specific  feature,  such  as  its  expectation. 


17.2  Distribution  features  and  sample  statistics 

In  Chapters  15  and  16  we  have  discussed  several  empirical  summaries  of 
datasets.  They  are  examples  of  numbers,  curves,  and  other  objects  that  are  a 
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function 


h{xi,X2,  ■■■,Xn) 


of  the  dataset  xi,  X2,  ■  ■  ■ ,  Xn  only.  Since  datasets  are  modeled  as  realizations 
of  random  samples  Xi,  X2, . . . ,  an  object  h(xi,  X2,  ■  ■  ■  ,Xn)  is  a,  realization 
of  the  corresponding  random  object 


h{Xi,X2,...,Xn). 


Such  an  object,  which  depends  on  the  random  sample  Xi,  X2, . . . ,  Xn  only,  is 
called  a  sample  statistic. 

If  a  statistical  model  adequately  describes  the  dataset  at  hand,  then  the  sample 
statistics  corresponding  to  the  empirical  summaries  should  somehow  reflect 
corresponding  features  of  the  model  distribution.  We  have  already  seen  a 
mathematical  justification  for  this  in  Chapter  13  for  the  sample  statistic 


_  Xi  +  X2  +  ■  ■  ■  +  X, 

■^n  — 


based  on  a  sample  Xi,  X2, . . . ,  Xn  from  a  probability  distribution  with  expec¬ 
tation  fj,.  According  to  the  law  of  large  numbers. 


lim  P(|A„  —  ^1  >  e)  =  0 

n — >00  '  ' 


for  every  e  >  0.  This  means  that  for  large  sample  size  n,  the  sample  mean 
of  most  realizations  of  the  random  sample  is  close  to  the  expectation  of  the 
corresponding  distribution.  In  fact,  all  sample  statistics  discussed  in  Chap¬ 
ters  15  and  16  are  close  to  corresponding  distribution  features.  To  illustrate 
this  we  generate  an  artificial  dataset  from  a  normal  distribution  with  pa¬ 
rameters  /i  =  5  and  a  =  2,  using  a  technique  similar  to  the  one  described 
in  Section  6.2.  Next,  we  compare  the  sample  statistics  with  corresponding 
features  of  this  distribution. 


The  empirical  distribution  function 

Let  Xi,  X2, . . . ,  Xn  be  a  random  sample  from  distribution  function  F,  and  let 
,  .  number  of  Xi  in  (—00,  a] 


be  the  empirical  distribution  function  of  the  sample.  Another  application  of 
the  law  of  large  numbers  (see  Exercise  13.7)  yields  that  for  every  e  >  0, 

lim  P(|F„(a)  —  f  (a)|  >  e)  =  0. 

n — >00 

This  means  that  for  most  realizations  of  the  random  sample  the  empirical 
distribution  function  is  close  to  F : 

Fn{a)  «  F{a). 
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Fig.  17.1.  Empirical  distribution  functions  of  normal  samples. 


Hence  the  empirical  distribution  function  of  the  normal  dataset  should  resem¬ 
ble  the  distribution  function 

J  —  oo  "V 

of  the  N{b,  4)  distribution,  and  the  fit  should  become  better  as  the  sample  size 
n  increases.  An  illustration  of  this  can  be  found  in  Figure  17.1.  We  displayed 
the  empirical  distribution  functions  of  datasets  generated  from  an  A(5,4) 
distribution  together  with  the  “true”  distribution  function  F  (dotted  lines), 
for  sample  sizes  n  =  20  (left)  and  n  =  200  (right). 


The  histogram  and  the  kernel  density  estimate 


Suppose  the  random  sample  Ai,  A2, . . . ,  A„  is  generated  from  a  continuous 
distribution  with  probability  density  /.  In  Section  13.4  we  have  seen  yet  an¬ 
other  consequence  of  the  law  of  large  numbers: 


number  of  Xi  in  {x  —  h^x  +  h] 
2hn 


fix)- 


When  (x  —  h,x  +  h]  is  a  bin  of  a  histogram  of  the  random  sample,  this  means 
that  the  height  of  the  histogram  approximates  the  value  of  /  at  the  midpoint 
of  the  bin: 


height  of  the  histogram  on  {x  —  h,  x  +  h]  «  fix). 

Similarly,  the  kernel  density  estimate  of  a  random  sample  approximates  the 
corresponding  probability  density  /: 


fnAx)  «  fix). 
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Fig.  17.2.  Histogram  and  kernel  density  estimate  of  a  sample  of  size  200. 


So  the  histogram  and  kernel  density  estimate  of  the  normal  dataset  should 
resemble  the  graph  of  the  probability  density 

of  the  iV(5,4)  distribution.  This  is  illustrated  in  Figure  17.2,  where  we  dis¬ 
played  a  histogram  and  a  kernel  density  estimate  of  our  dataset  consisting  of 
200  values  generated  from  the  iV(5,4)  distribution.  It  should  be  noted  that 
with  a  smaller  dataset  the  similarity  can  be  much  worse.  This  is  demonstrated 
in  Figure  17.3,  which  is  based  on  the  dataset  consisting  of  20  values  generated 
from  the  same  distribution. 


-2  0  2  4  6  8  10  12 
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Fig.  17.3.  Histogram  and  kernel  density  estimate  of  a  sample  of  size  20. 
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Remark  17.1  (About  the  approximations).  Let  H„  be  the  height  of 
the  histogram  on  the  interval  {x  —  h,x  +  h],  which  is  assumed  to  be  a  bin  of 
the  histogram.  Direct  application  of  the  law  of  large  numbers  merely  yields 
that  Hn  converges  to 


-f 


:  +  h. 


/(w)  dw. 


Only  for  small  h  this  is  close  to  f{x).  However,  if  we  let  h  tend  to  0  as  n 
increases,  a  variation  on  the  law  of  large  numbers  will  guarantee  that  Hn 
converges  to  f{x)-.  for  every  e  >  0, 


lim  P{\Hn  -  f{x)\  >  e)  =  0. 

n — *-oo 

A  possible  choice  is  the  optimal  bin  width  mentioned  in  Remark  15.1.  Sim¬ 
ilarly,  direct  application  of  the  law  of  large  numbers  yields  that  a  kernel 
density  estimator  with  fixed  bandwidth  h  converges  to 

/OO 

f{x  +  hu)K{u)  du. 

-  OO 

Once  more,  only  for  small  h  this  is  close  to  f{x),  provided  that  K  is  sym¬ 
metric  and  integrates  to  one.  However,  by  letting  the  bandwidth  h  tend 
to  0  as  n  increases,  yet  another  variation  on  the  law  of  large  numbers  will 
guarantee  that  fn,h{x)  converges  to  /(*):  for  every  e  >  0, 


lim  P{\fn,h{x)  -  f{x)\  >  e)  =  0. 

n — ►OO 

A  possible  choice  is  the  optimal  bandwidth  mentioned  in  Remark  15.2. 


The  sample  mean,  the  sample  median,  and  empirical  quantiles 

As  we  saw  in  Section  5.5,  the  expectation  of  an  A(/r,  cr^)  distribution  is  /x; 
so  the  A(5,4)  distribution  has  expectation  5.  According  to  the  law  of  large 
numbers:  «  fi.  This  is  illustrated  by  our  dataset  of  200  values  generated 

from  the  A(5,4)  distribution  for  which  we  find 


2^200  =  5.012. 


For  the  sample  median  we  find 

Med(a;i, . . . ,  2:200)  =  5.018. 

This  illustrates  the  fact  that  the  sample  median  of  a  random  sample  from 
F  approximates  the  median  go. 5  =  F'™''(0.5).  In  fact,  we  have  the  following 
general  property  for  the  pth  empirical  quantile: 

qn(p)  «  =  qp. 

In  the  special  case  of  the  distribution,  the  expectation  and  the  me¬ 

dian  coincide,  which  explains  why  the  sample  mean  and  sample  median  of  the 
normal  dataset  are  so  close  to  each  other. 
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The  sample  variance  and  standard  deviation,  and  the  MAD 

As  we  saw  in  Section  5.5,  the  standard  deviation  and  variance  of  an  iV(/i,  cr^) 
distribution  are  a  and  cr^;  so  for  the  A(5,4)  distribution  these  are  2  and  4. 
Another  consequence  of  the  law  of  large  numbers  is  that 

~  (T^  and  Sn  ~  cr. 

This  is  illustrated  by  our  normal  dataset  of  size  200,  for  which  we  find 
S200  =  4.761  and  S200  =  2.182 

for  the  sample  variance  and  sample  standard  deviation. 

For  the  MAD  of  the  dataset  we  find  1.334,  which  clearly  differs  from  the 
standard  deviation  2  of  the  A(5,4)  distribution.  The  reason  is  that 

MAD(Ai,  A2, . . . ,  A„)  «  F“''(0.75)  -  F“''(0.5), 

for  any  distribution  that  is  symmetric  around  its  median  F'"''(0.5).  For  the 
A(5,4)  distribution  J^“''(0.75)  -  i^”''(0.5)  =  24>“''(0.75)  =  1.3490,  where 
4)  denotes  the  distribution  function  of  the  standard  normal  distribution  (see 
Exercise  17.10). 

Relative  frequencies 

For  continuous  distributions  the  histogram  and  kernel  density  estimates  of  a 
random  sample  approximate  the  corresponding  probability  density  /.  For  dis¬ 
crete  distributions  we  would  like  to  have  a  sample  statistic  that  approximates 
the  probability  mass  function.  In  Section  13.4  we  saw  that,  as  a  consequence 
of  the  law  of  large  numbers,  relative  frequencies  based  on  a  random  sample  ap¬ 
proximate  corresponding  probabilities.  As  a  special  case,  for  a  random  sample 
Ai,  X2, . . . ,  Xn  from  a  discrete  distribution  with  probability  mass  function  p, 
one  has  that 

number  of  Xi  equal  to  a  ,  . 


This  means  that  the  relative  frequency  of  a’s  in  the  sample  approximates 
the  value  of  the  probability  mass  function  at  a.  Table  17.2  lists  the  sample 
statistics  and  the  corresponding  distribution  features  they  approximate. 


17.3  Estimating  features  of  the  “true”  distribution 

In  the  previous  section  we  generated  a  dataset  of  200  elements  from  a  proba¬ 
bility  distribution,  and  we  have  seen  that  certain  features  of  this  distribution 
are  approximated  by  corresponding  sample  statistics.  In  practice,  the  situa¬ 
tion  is  reversed.  In  that  case  we  have  a  dataset  of  n  elements  that  is  modeled 
as  the  realization  of  a  random  sample  with  a  probability  distribution  that  is 
unknown  to  us.  Our  goal  is  to  use  our  dataset  to  estimate  a  certain  feature 
of  this  distribution  that  represents  the  quantity  of  interest.  In  this  section  we 
will  discuss  a  few  examples. 
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Table  17.2.  Some  sample  statistics  and  corresponding  distribution  features. 


Sample  statistic 

Distribution  feature 

Graphical 

Empirical  distribution  function  Fn 

Distribution  function  F 

Kernel  density  estimate  fn,h  and  histogram 

Probability  density  / 

(Number  of  Xi  equal  to  a)/n 

Probability  mass  function  p{a) 

Numerical 

Sample  mean 

Expectation  jj, 

Sample  median  Med(Xi,  X2, . . . ,  Xn) 

Median  go. 5  =  F‘"''(0.5) 

pth  empirical  quantile  qn{p) 

lOOpth  percentile  qp  =  F‘“''(p) 

Sample  variance 

Variance 

Sample  standard  deviation  Sn 

Standard  deviation  a 

MAD(Xi,X2,...,X„) 

(0.75)  (0.5),  for 

symmetric  F 

The  Old  Faithful  data 

We  stick  to  the  assumptions  of  Section  17.1:  by  lack  of  knowledge  on  this  phe¬ 
nomenon  we  prefer  not  to  specify  a  particular  parametric  type  of  distribution, 
and  we  model  the  Old  Faithful  data  as  the  realization  of  a  random  sample  of 
size  272  from  a  continuous  probability  distribution.  From  the  previous  section 
we  know  that  the  kernel  density  estimate  and  the  empirical  distribution  func¬ 
tion  of  the  dataset  approximate  the  probability  density  /  and  the  distribution 
function  F  of  this  distribution.  In  Figure  17.4  a  kernel  density  estimate  (left) 
and  the  empirical  distribution  function  (right)  are  displayed.  Indeed,  neither 
graph  resembles  the  probability  density  function  or  distribution  function  of 
any  of  the  familiar  parametric  distributions.  Instead  of  viewing  both  graphs 


Fig.  17.4.  Nonparametric  estimates  for  /  and  F  based  on  the  Old  Faithful  data. 
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only  as  graphical  summaries  of  the  data,  we  can  also  use  both  curves  as  esti¬ 
mates  for  /  and  F.  We  estimate  the  model  probability  density  /  by  means  of 
the  kernel  density  estimate  and  the  model  distribution  function  F  by  means 
of  the  empirical  distribution  function.  Since  neither  estimate  assumes  a  par¬ 
ticular  parametric  model,  they  are  called  nonparametric  estimates. 

The  software  data 

Next  consider  the  software  reliability  data.  As  motivated  in  Section  17.1, 
we  model  interfailure  times  as  the  realization  of  a  random  sample  from  an 
exponential  distribution.  To  see  whether  an  exponential  distribution  is  indeed 
a  reasonable  model,  we  plot  a  histogram  and  a  kernel  density  estimate  using 
a  boundary  kernel  in  Figure  17.5. 


Fig.  17.5.  Histogram  and  kernel  density  estimate  for  the  software  data. 


Both  seem  to  corroborate  the  assumption  of  an  exponential  distribution.  Ac¬ 
cepting  this,  we  are  left  with  estimating  the  parameter  A.  Because  for  the 
exponential  distribution  E[A]  =  1/A,  the  law  of  large  numbers  suggests  1/x 
as  an  estimate  for  A.  For  our  dataset  x  =  656.88,  which  yields  l/x  =  0.0015. 
In  Figure  17.6  we  compare  the  estimated  exponential  density  (left)  and  dis¬ 
tribution  function  (right)  with  the  corresponding  nonparametric  estimates. 
Note  that  the  nonparametric  estimates  do  not  assume  an  exponential  model 
for  the  data.  But,  if  an  exponential  distribution  were  the  right  model,  the 
kernel  density  estimate  and  empirical  distribution  function  should  resemble 
the  estimated  exponential  density  and  distribution  function.  At  first  sight  the 
fit  seems  reasonable,  although  near  zero  the  data  accumulate  more  than  one 
might  perhaps  expect  for  a  sample  of  size  135  from  an  exponential  distri¬ 
bution,  and  the  other  way  around  at  the  other  end  of  the  data  range.  The 
question  is  whether  this  phenomenon  can  be  attributed  to  chance  or  is  caused 
by  the  fact  that  the  exponential  model  is  the  wrong  model.  We  will  return  to 
this  type  of  question  in  Chapter  25  (see  also  Chapter  18). 
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Fig.  17.6.  Kernel  density  estimate  and  empirical  cdf  for  software  data  (solid)  com¬ 
pared  to  /  and  F  of  the  estimated  exponential  distribution. 


Michelson  data 

Consider  the  Michelson  data  on  the  speed  of  light.  In  this  case  we  are  not 
particularly  interested  in  estimation  of  the  “true”  distribution,  but  solely  in 
the  expectation  of  this  distribution,  which  represents  the  speed  of  light.  The 
law  of  large  numbers  suggests  to  estimate  the  expectation  by  the  sample 
mean  x,  which  equals  852.4. 


17.4  The  linear  regression  model 

Recall  the  example  about  predicting  Janka  hardness  of  wood  from  the  density 
of  the  wood  in  Section  15.5.  The  idea  is,  of  course,  that  Janka  hardness  is 
related  to  the  density:  the  higher  the  density  of  the  wood,  the  higher  the 
value  of  Janka  hardness.  This  suggests  a  relationship  of  the  type 

hardness  =  5  (density  of  timber) 

for  some  increasing  function  g.  This  is  supported  by  the  scatterplot  of  the  data 
in  Figure  17.7.  A  closer  look  at  the  bivariate  dataset  in  Table  15.5  suggests 
that  randomness  is  also  involved.  For  instance,  for  the  value  51.5  of  the  density, 
different  corresponding  values  of  Janka  hardness  were  observed.  One  way  to 
model  such  a  situation  is  by  means  of  a  regression  model: 

hardness  =  ^(density  of  timber)  -|-  random  fluctuation. 

The  important  question  now  is  what  sort  of  function  g  fits  well  to  the  points 
in  the  scatterplot? 

In  general,  this  may  be  a  difficult  question  to  answer.  We  may  have  so  little 
knowledge  about  the  phenomenon  under  study,  and  the  data  points  may  be 
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Fig.  17.7.  Scatterplot  of  Janka  hardness  versus  wood  density. 


scattered  in  such  a  way,  that  there  is  no  reason  to  assume  a  specific  type  of 
function  for  g.  However,  for  the  Janka  hardness  data  it  makes  sense  to  assume 
that  g  is  increasing,  but  this  still  leaves  us  with  many  possibilities.  Looking  at 
the  scatterplot,  at  first  sight  it  does  not  seem  unreasonable  to  assume  that  g  is 
a  straight  line,  i.e.,  Janka  hardness  depends  linearly  on  the  density  of  timber. 
The  fact  that  the  points  are  not  exactly  on  a  straight  line  is  then  modeled  by 
a  random  fluctuation  with  respect  to  the  straight  line: 

hardness  =  a  +  (3  ■  (density  of  timber)  +  random  fluctuation. 

This  is  a  loose  description  of  a  simple  linear  regression  model.  A  more  complete 
description  is  given  below. 


Simple  linear  regression  model.  In  a  simple  linear  regression 
model  for  a  bivariate  dataset  (a;i,  j/i),  (x2, 1/2),  •  ■  ■ ,  (a:^,  yn))  we  as¬ 
sume  that  Xi,X2,  ■  ■  ■  ,Xn  are  nonrandom  and  that  yi,y2,  ■  ■  ■  ,yn  are 
realizations  of  random  variables  Yi,Y2, . . .  ,Yn  satisfying 

Yi  =  a  +  [3xi  +  Ui  for  i  =  1, 2, . . . ,  n, 

where  ?7i , . . . ,  are  independent  random  variables  with  E  \Ui]  =  0 
and  Var(C/i)  = 


The  line  y  =  a  +  fdx  is  called  the  regression  line.  The  parameters  a  and  (3 
represent  the  intercept  and  slope  of  the  regression  line.  Usually,  the  a:-variable 
is  called  the  explanatory  variable  and  the  y-variable  is  called  the  response 
variable.  One  also  refers  to  x  and  y  as  independent  and  dependent  variables. 
The  random  variables  C/i,  C/2, . . . ,  are  assumed  to  be  independent  when  the 
different  measurements  do  not  influence  each  other.  They  are  assumed  to  have 
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expectation  zero,  because  the  random  fluctuation  is  considered  to  be  around 
the  regression  line  y  =  a  +  jSx.  Finally,  because  each  random  fluctuation 
is  supposed  to  have  the  same  amount  of  variability,  we  assume  that  all  Ui 
have  the  same  variance.  Note  that  by  the  propagation  of  independence  rule 
in  Section  9.4,  independence  of  the  Ui  implies  independence  of  Yi.  However, 
Yi^Y2,  .  ■ .  ^Yn  do  not  form  a  random  sample.  Indeed,  the  Yi  have  different 
distributions  because  every  Yi  has  a  different  expectation 


E  [Yi\  =  E  [a  +  Pxi  +  Ui]  =  a  +  j3xi  +  E  [Ui]  =  a  +  (3xi. 


Quick  exercise  17.3  Consider  the  simple  linear  regression  model  as  defined 
earlier.  Compute  the  variance  of  Yi. 

The  parameters  a  and  (3  are  unknown  and  our  task  will  be  to  estimate  them  on 
the  basis  of  the  data.  We  will  come  back  to  this  in  Chapter  22.  In  Figure  17.8 
the  scatterplot  for  the  Janka  hardness  data  is  displayed  with  the  estimated 
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Fig.  17.8.  Estimated  regression  line  for  the  Janka  hardness  data. 


y  =  a  +  (3x  +  yx' 


,2 


would  be  a  more  appropriate  model.  By  trying  to  answer  this  question  we 
enter  the  area  of  multiple  linear  regression.  We  will  not  pursue  this  topic;  we 
restrict  ourselves  to  simple  linear  regression. 
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17.5  Solutions  to  the  quick  exercises 

17.1  Because  Xi,X2  form  a  random  sample,  they  are  independent.  Using 
the  rule  about  the  variance  of  the  sum  of  independent  random  variables,  this 
means  that  Xai{Xi  +  X2)  =  Var(Xi)  +  Var(lf2)  =  1  +  1  =  2. 

17.2  The  result  of  each  toss  of  a  coin  can  be  modeled  by  a  Bernoulli  random 
variable  taking  values  1  (heads)  and  0  (tails).  In  the  case  when  it  is  known 
that  we  are  tossing  a  fair  coin,  heads  and  tails  occur  with  equal  probability. 
Since  it  is  reasonable  to  assume  that  the  tosses  do  not  influence  each  other, 
the  outcomes  of  the  ten  tosses  are  modeled  as  the  realization  of  a  random 
sample  Xi, . . . ,  Xiq  from  a  Bernoulli  distribution  with  parameter  p  =  1/2.  In 
this  case  the  model  distribution  is  completely  specified  and  coincides  with  the 
“true”  distribution:  a  Ber{^)  distribution. 

In  the  case  when  we  are  dealing  with  a  possibly  unfair  coin,  the  outcomes 
of  the  ten  tosses  are  still  modeled  as  the  realization  of  a  random  sample 
Xi, . . . ,  XiQ  from  a  Bernoulli  distribution,  but  we  cannot  specify  the  value 
of  the  parameter  p.  The  model  distribution  is  a  Bernoulli  distribution.  The 
“true”  distribution  is  a  Bernoulli  distribution  with  one  particular  value  for  p, 
unknown  to  us. 

17.3  Note  that  the  Xi  are  considered  nonrandom.  By  the  rules  for  the  vari¬ 
ance,  we  find  Var(l^)  =  Var(a  +  fixi  +  Ui)  =  Var([/i)  =  cr^. 


17.6  Exercises 

17.1  □  Figure  17.9  displays  several  histograms,  kernel  density  estimates,  and 
empirical  distribution  functions.  It  is  known  that  all  figures  correspond  to 
datasets  of  size  200  that  are  generated  from  normal  distributions  IV(0,1), 
N{0, 9),  and  iV(3, 1),  and  from  exponential  distributions  Exp{l)  and  Exp{l/3). 
Report  for  each  figure  from  which  distribution  the  dataset  has  been  generated. 

17.2  □  Figure  17.10  displays  several  boxplots.  It  is  known  that  all  figures 
correspond  to  datasets  of  size  200  that  are  generated  from  the  same  five  dis¬ 
tributions  as  in  Exercise  17.1.  Report  for  each  boxplot  from  which  distribution 
the  dataset  has  been  generated. 

17.3  ffl  At  a  London  underground  station,  the  number  of  women  was  counted 
in  each  of  100  queues  of  length  10.  In  this  way  a  dataset  xi,X2,  ■  ■  ■ ,  xioo  was 
obtained,  where  Xi  denotes  the  observed  number  of  women  in  the  ith  queue. 
The  dataset  is  summarized  in  the  following  table  and  lists  the  number  of 
queues  with  0  women,  1  woman,  2  women,  etc. 
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Dataset  1 


Dataset  4 


-2  0  2  4  6  8 

Dataset  7 


Dataset  10 
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Dataset  3 


Dataset  6 


Dataset  12 


Dataset  15 


Fig.  17.9.  Graphical  representations  of  different  datasets  from  Exercise  17.1. 
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Boxplot  1 
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Fig.  17.10.  Boxplot  of  different  datasets  from  Exercise  17.2. 
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Count  012  3  4  5  6789  10 

Frequency  1  3  4  23  25  19  18  5  1  1  0 


Source:  R.A.  Jinkinson  and  M.  Slater.  Critical  discussion  of  a  graphical 
method  for  identifying  discrete  distributions.  The  Statistician,  30:239—248, 

1981;  Table  1  on  page  240. 

In  the  statistical  model  for  this  dataset,  we  assume  that  the  observed  counts 
are  a  realization  of  a  random  sample  Xi,  X2,  ■  ■  ■ ,  ^100  • 

a.  Assume  that  people  line  up  in  such  a  way  that  a  man  or  woman  in  a 
certain  position  is  independent  of  the  other  positions,  and  that  in  each 
position  one  has  a  woman  with  equal  probability.  What  is  an  appropriate 
choice  for  the  model  distribution? 

b.  Use  the  table  to  find  an  estimate  for  the  parameter(s)  of  the  model  dis¬ 
tribution  chosen  in  part  a. 

17.4  During  the  Second  World  War,  London  was  hit  by  numerous  flying 
bombs.  The  following  data  are  from  an  area  in  South  London  of  36  square 
kilometers.  The  area  was  divided  into  576  squares  with  sides  of  length  1/4 
kilometer.  For  each  of  the  576  squares  the  number  of  hits  was  recorded.  In 
this  way  we  obtain  a  dataset  xi,X2,  ■  ■  ■ ,  x^re,  where  Xi  denotes  the  number  of 
hits  in  the  ith  square.  The  data  are  summarized  in  the  following  table  which 
lists  the  number  of  squares  with  no  hits,  1  hit,  2  hits,  etc. 


Number  of  hits  0  1  2  34567 

Number  of  squares  229  211  93  35  7001 


Source:  R.D.  Clarke.  An  application  of  the  Poisson  distribution.  Journal  of 
the  Institute  of  Actuaries,  72:48,  1946;  Table  1  on  page  481.  (c)  Faculty  and 
Institute  of  Actuaries. 

An  interesting  question  is  whether  London  was  hit  in  a  completely  random 
manner.  In  that  case  a  Poisson  distribution  should  fit  the  data. 

a.  If  we  model  the  dataset  as  the  realization  of  a  random  sample  from  a 
Poisson  distribution  with  parameter  then  what  would  you  choose  as  an 
estimate  for  /x? 

b.  Check  the  fit  with  a  Poisson  distribution  by  comparing  some  of  the  ob¬ 
served  relative  frequencies  of  O’s,  I’s,  2’s,  etc.,  with  the  corresponding 
probabilities  for  the  Poisson  distribution  with  /i  estimated  as  in  part  a. 

17.5  □  We  return  to  the  example  concerning  the  number  of  menstrual  cycles 
up  to  pregnancy,  where  the  number  of  cycles  was  modeled  by  a  geometric 
random  variable  (see  Section  4.4).  The  original  data  concerned  100  smoking 
and  486  nonsmoking  women.  For  7  smokers  and  12  nonsmokers,  the  exact 
number  of  cycles  up  to  pregnancy  was  unknown.  In  the  following  tables  we  only 
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incorporated  the  93  smokers  and  474  nonsmokers,  for  which  the  exact  number 
of  cycles  was  observed.  Another  analysis,  based  on  the  complete  dataset,  is 
done  in  Section  21.1. 

a.  Consider  the  dataset  xi,X2,  ■  ■  ■ ,  Xq3  corresponding  to  the  smoking  women, 
where  Xi  denotes  the  number  of  cycles  for  the  ith  smoking  woman.  The 
data  are  summarized  in  the  following  table. 


Cycles  1  2  3  4  5  6  7  8  9  10  11  12 

Frequency  29  16  17  4  3  9  4  5  1  1  1  3 


Source:  C.R.  Weinberg  and  B.C.  Gladen.  The  beta-geometric  distribution  ap¬ 
plied  to  comparative  fecundability  studies.  Biometrics,  42(3) :547— 560,  1986. 

The  table  lists  the  number  of  women  that  had  to  wait  1  cycle,  2  cycles, 
etc.  If  we  model  the  dataset  as  the  realization  of  a  random  sample  from  a 
geometric  distribution  with  parameter  p,  then  what  would  you  choose  as 
an  estimate  for  p? 

b.  Also  estimate  the  parameter  p  for  the  474  nonsmoking  women,  which 
is  also  modeled  as  the  realization  of  a  random  sample  from  a  geometric 
distribution.  The  dataset  j/i,  p2,  ■  ■  ■ ,  2/474,  where  yj  denotes  the  number  of 
cycles  for  the  jth  nonsmoking  woman,  is  summarized  here: 


Cycles  1  2  3  4  5  6  7  8  9  10  11  12 

Frequency  198  107  55  38  18  22  7  9  5  3  6  6 


Source:  C.R.  Weinberg  and  B.C.  Claden.  The  beta-geometric  distribution  ap¬ 
plied  to  comparative  fecundability  studies.  Biometrics,  42(3) :547— 560,  1986. 

You  may  use  that  2/1  +  2/2  +  •  •  •  +  2/474  =  1285. 

c.  Compare  the  estimates  of  the  probability  of  becoming  pregnant  in  three 
or  fewer  cycles  for  smoking  and  nonsmoking  women. 

17.6  Recall  Exercise  15.1  about  the  chest  circumference  of  5732  Scottish  sol¬ 
diers,  where  we  constructed  the  histogram  displayed  in  Figure  17.11.  The 
histogram  suggests  modeling  the  data  as  the  realization  of  a  random  sample 
from  a  normal  distribution. 

a.  Suppose  that  for  the  dataset  '^Xi  =  228377.2  and  =  9124064.  What 
would  you  choose  as  estimates  for  the  parameters  y  and  a  of  the  iV(/i,  cr^) 
distribution? 

Hint:  you  may  want  to  use  the  relation  from  Exercise  16.15. 

b.  Give  an  estimate  for  the  probability  that  a  Scottish  soldier  has  a  chest 
circumference  between  38.5  and  42.5  inches. 
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32  34  36  38  40  42  44  46  48  50 

Fig.  17.11.  Histogram  of  chest  circumferences. 


17.7  ffl  Recall  Exercise  15.3  about  time  intervals  between  successive  coal  mine 
disasters.  Let  us  assume  that  the  rate  at  which  the  disasters  occur  is  constant 
over  time  and  that  on  a  single  day  a  disaster  takes  place  with  small  probability 
independently  of  what  happens  on  other  days.  According  to  Chapter  12  this 
suggests  modeling  the  series  of  disasters  with  a  Poisson  process.  Figure  17.12 
displays  a  histogram  and  empirical  distribution  function  of  the  observed  time 
intervals. 

a.  In  the  statistical  model  for  this  dataset  we  model  the  190  time  intervals 
as  the  realization  of  a  random  sample.  What  would  you  choose  for  the 
model  distribution? 

b.  The  sum  of  the  observed  time  intervals  is  40  549  days.  Give  an  estimate 
for  the  parameter  (s)  of  the  distribution  chosen  in  part  a. 
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Fig.  17,12.  Histogram  of  time  intervals  between  successive  disasters. 
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17.8  The  following  data  represent  the  number  of  revolutions  to  failure  (in 
millions)  of  22  deep-groove  ball-bearings. 


17.88 

28.92 

33.00 

41.52 

42.12 

45.60 

48.48 

51.84 

51.96 

54.12 

55.56 

67.80 

68.64 

68.88 

84.12 

93.12 

98.64 

105.12 

105.84 

127.92 

128.04 

173.40 

Source:  J.  Lieblein  and  M.  Zelen.  Statistical  investigation  of  the  fatigue-life 
of  deep-groove  ball-bearings.  Journal  of  Research,  National  Bureau  of  Stan¬ 
dards,  57:273—316,  1956;  specimen  worksheet  on  page  286. 

Lieblein  and  Zelen  propose  modeling  the  dataset  as  a  realization  of  a  random 
sample  from  a  Weibull  distribution,  which  has  distribution  function 

F{x)  =  1  —  for  a;  >  0, 

and  F(x)  =  0,  for  x  <  0,  where  a,  A  >  0. 

a.  Suppose  that  X  is  a  random  variable  with  a  Weibull  distribution.  Check 
that  the  random  variable  V  =  X°‘  has  an  exponential  distribution  with 
parameter  A“  and  conclude  that  E[X“]  =  1/A“. 

b.  Use  part  a  to  explain  how  one  can  use  the  data  in  the  table  to  find 
an  estimate  for  the  parameter  A,  if  it  is  given  that  the  parameter  a  is 
estimated  by  2.102. 

17.9  ffl  The  volume  (i.e.,  the  effective  wood  production  in  cubic  meters), 
height  (in  meters),  and  diameter  (in  meters)  (measured  at  1.37  meter  above 
the  ground)  are  recorded  for  31  black  cherry  trees  in  the  Allegheny  National 
Forest  in  Pennsylvania.  The  data  are  listed  in  Table  17.3.  They  were  collected 
to  find  an  estimate  for  the  volume  of  a  tree  (and  therefore  for  the  timber 
yield),  given  its  height  and  diameter.  For  each  tree  the  volume  y  and  the 
value  of  x  =  <Ph  are  recorded,  where  d  and  h  are  the  diameter  and  height 
of  the  tree.  The  resulting  points  (xi,  j/i), . . . ,  (xai,  2/31)  are  displayed  in  the 
scatterplot  in  Figure  17.13. 

We  model  the  data  by  the  following  linear  regression  model  (without  intercept) 

r,  =  fix,  +  u. 


for  i  =  1,2,...,31. 

a.  What  physical  reasons  justify  the  linear  relationship  between  y  and  d^hl 
Hint:  how  does  the  volume  of  a  cylinder  relate  to  its  diameter  and  height? 

b.  We  want  to  find  an  estimate  for  the  slope  fi  of  the  line  y  =  fix.  Two 
natural  candidates  are  the  average  slope  where  Zi  =  yijxi,  and  the 
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Table  17.3.  Measurements  on  black  cherry  trees. 


Diameter 

Height 

Volume 

0.21 

21.3 

0.29 

0.22 

19.8 

0.29 

0.22 

19.2 

0.29 

0.27 

21.9 

0.46 

0.27 

24.7 

0.53 

0.27 

25.3 

0.56 

0.28 

20.1 

0.44 

0.28 

22.9 

0.52 

0.28 

24.4 

0.64 

0.28 

22.9 

0.56 

0.29 

24.1 

0.69 

0.29 

23.2 

0.59 

0.29 

23.2 

0.61 

0.30 

21.0 

0.60 

0.30 

22.9 

0.54 

0.33 

22.6 

0.63 

0.33 

25.9 

0.96 

0.34 

26.2 

0.78 

0.35 

21.6 

0.73 

0.35 

19.5 

0.71 

0.36 

23.8 

0.98 

0.36 

24.4 

0.90 

0.37 

22.6 

1.03 

0.41 

21.9 

1.08 

0.41 

23.5 

1.21 

0.44 

24.7 

1.57 

0.44 

25.0 

1.58 

0.45 

24.4 

1.65 

0.46 

24.4 

1.46 

0.46 

24.4 

1.44 

0.52 

26.5 

2.18 

Source:  A.C.  Atkinson.  Regression  diagnostics,  trend  formations  and  con¬ 
structed  variables  (with  discussion).  Journal  of  the  Royal  Statistical  Society, 
Series  B,  44:1-36,  1982. 


slope  of  the  averages  y/x.  In  Chapter  22  we  will  encounter  the  so-called 
least  squares  estimate: 

n 

i^l 

n 
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2.5  -1 
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Fig.  17.13.  Scatterplot  of  the  black  cherry  tree  data. 


Compute  all  three  estimates  for  the  data  in  Table  17.3.  You  need  at  least 
5  digits  accuracy,  and  you  may  use  that  =  87.456,  =  26.486, 

'^Ui/xi  =  9.369,  J^XiUi  =  95.498,  and  '^xj  =  314.644. 

17.10  Let  Y  be  a  random  variable  with  (continuous)  distribution  function  F . 
Let  m  =  go. 5  =  ^"“''(0.5)  be  the  median  of  F  and  define  the  random  variable 

Y  =  |Y-m|. 

a.  Show  that  Y  has  distribution  function  G,  defined  by 

G{y)  =  F(m  +  y)-  F(m  -  y). 

b.  The  MAD  of  F  is  the  median  of  G.  Show  that  if  the  density  /  correspond¬ 
ing  to  F  is  symmetric  around  its  median  m,  then 

G{y)  =  2F{m  +  y)  -  1 


and  derive  that 

^mv(l)^^mv(3)_^inv(l)_ 

c.  Use  b  to  conclude  that  the  MAD  of  an  distribution  is  equal  to 

^<j)inv(3/4),  (j)  jg  ^jjg  distribution  function  of  a  standard  normal 

distribution.  Recall  that  the  distribution  function  F  of  an  N{y,,a'^)  can 
be  written  as 

. 

You  might  check  that,  as  stated  in  Section  17.2,  the  MAD  of  the  A(5,4) 
distribution  is  equal  to  2<i)™''(3/4)  =  1.3490. 
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17.11  In  this  exercise  we  compute  the  MAD  of  the  Exp{X)  distribution. 

a.  Let  X  have  an  Exp{X)  distribution,  with  median  m  =  (In2)/A.  Show  that 
Y  =  \X  —  m\  has  distribution  function 

G{y)  =  \{e^y-e-^y). 

b.  Argue  that  the  MAD  of  the  Exp{X)  distribution  is  a  solution  of  the  equa¬ 
tion  -  e^2^  -  1  =  0. 

c.  Compute  the  MAD  of  the  Exp{X)  distribution. 

Hint:  put  x  =  e^y  and  first  solve  for  x. 
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In  the  forthcoming  chapters  we  will  develop  statistical  methods  to  infer  knowl¬ 
edge  about  the  model  distribution  and  encounter  several  sample  statistics  to 
do  this.  In  the  previous  chapter  we  have  seen  examples  of  sample  statistics 
that  can  be  used  to  estimate  different  model  features,  for  instance,  the  em¬ 
pirical  distribution  function  to  estimate  the  model  distribution  function  F, 
and  the  sample  mean  to  estimate  the  expectation  /i  corresponding  to  F.  One 
of  the  things  we  would  like  to  know  is  how  close  a  sample  statistic  is  to  the 
model  feature  it  is  supposed  to  estimate.  For  instance,  what  is  the  probability 
that  the  sample  mean  and  fj,  differ  more  than  a  given  tolerance  e?  For  this 
we  need  to  know  the  distribution  of  More  generally,  it  is  important 

to  know  how  a  sample  statistic  is  distributed  in  relation  to  the  corresponding 
model  feature.  For  the  distribution  of  the  sample  mean  we  saw  a  normal  limit 
approximation  in  Chapter  14.  In  this  chapter  we  discuss  a  simulation  proce¬ 
dure  that  approximates  the  distribution  of  the  sample  mean  for  finite  sample 
size.  Moreover,  the  method  is  more  generally  applicable  to  sample  statistics 
other  than  the  sample  mean. 


18.1  The  bootstrap  principle 

Consider  the  Old  Faithful  data  introduced  in  Chapter  15,  which  we  modeled 
as  the  realization  of  a  random  sample  of  size  n  =  272  from  some  distribution 
function  F.  The  sample  mean  Xn  of  the  observed  durations  equals  209.3.  What 
does  this  say  about  the  expectation  fi  of  F?  As  we  saw  in  Chapter  17,  the  value 
209.3  is  a  natural  estimate  for  /i,  but  to  conclude  that  fx  is  equal  to  209.3  is 
unwise.  The  reason  is  that,  if  we  would  observe  a  new  dataset  of  durations,  we 
will  obtain  a  different  sample  mean  as  an  estimate  for  /r.  This  should  not  come 
as  a  surprise.  Since  the  dataset  xi,X2,  ■ . .  ,Xn  is  just  one  possible  realization 
of  the  random  sample  Xi,  X2, . . . ,  Xn,  the  observed  sample  mean  is  just  one 
possible  realization  of  the  random  variable 
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^  _  ^1  +  X2  +  ■  •  •  +  Xn 
n 

A  new  dataset  is  another  realization  of  the  random  sample,  and  the  cor¬ 
responding  sample  mean  is  another  realization  of  the  random  variable  Xn- 
Hence,  to  infer  something  about  /i,  one  should  take  into  account  how  realiza¬ 
tions  of  Xn  vary.  This  variation  is  described  by  the  probability  distribution 
of 

In  principle^  it  is  possible  to  determine  the  distribution  function  of  A„  from 
the  distribution  function  F  of  the  random  sample  Ai,  A2, . . . ,  A„.  However, 
F  is  unknown.  Nevertheless,  in  Chapter  17  we  saw  that  the  observed  dataset 
reflects  most  features  of  the  “true”  probability  distribution.  Hence  the  natural 
thing  to  do  is  to  compute  an  estimate  F  for  the  distribution  function  F  and 
then  to  consider  a  random  sample  from  F  and  the  corresponding  sample  mean 
as  substitutes  for  the  random  sample  Xi,  A2, . . . ,  A„  from  F  and  the  random 
variable  Xn-  A  random  sample  from  F  is  called  a  bootstrap  random  sample, 
or  briefly  bootstrap  sample,  and  is  denoted  by 

^1  j  ^2  J  •  ■  ■  J 

to  distinguish  it  from  the  random  sample  Ai,  A2, . . . ,  A„  from  the  “true”  F. 
The  corresponding  average  is  called  the  bootstrapped  sample  mean,  and  this 
random  variable  is  denoted  by 

Xf  +  A|  +  ..-  +  A* 
n 

to  distinguish  it  from  the  random  variable  A^.  The  idea  is  now  to  use  the 
distribution  of  A*  to  approximate  the  distribution  of  A„. 

The  preceding  procedure  is  called  the  bootstrap  principle  for  the  sample  mean. 
Clearly,  it  can  be  applied  to  any  sample  statistic  h{Xi ,  X2,  ■  ■  ■ ,  Xn)  by  approx¬ 
imating  its  probability  distribution  by  that  of  the  corresponding  bootstrapped 
sample  statistic  h{X*,  A^ , . . . ,  A*). 


Bootstrap  principle.  Use  the  dataset  xi,X2,...,Xn  to  com¬ 
pute  an  estimate  F  for  the  “true”  distribution  function  F.  Replace 
the  random  sample  Ai,A2,...,A„  from  U  by  a  random  sample 
A*,  A2, . . . ,  A*  from  F,  and  approximate  the  probability  distribu¬ 
tion  of  h(Ai,  A2, . . .  ,Xn)  by  that  of  /i(AJ',  A|, . . . ,  A*). 


Returning  to  the  sample  mean,  the  first  question  that  comes  to  mind  is,  of 
course,  how  well  does  the  distribution  of  A*  approximate  the  distribution 

^  In  Section  11.1  we  saw  how  the  distribution  of  the  sum  of  independent  random 
variables  can  be  computed.  Together  with  the  change-of-units  rule  (see  page  106), 
the  distribution  of  A„  can  be  determined.  See  also  Section  13.1,  where  this  is  done 
for  independent  Gam{2, 1)  variables. 
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of  X„?  Or  more  generally,  how  well  does  the  distribution  of  a  bootstrapped 
sample  statistic  h{Xl,  X|, . . . ,  X*)  approximate  the  distribution  of  the  sam¬ 
ple  statistic  of  interest  /i(Xi,  X2, . . . ,  Xn)l  Applied  in  such  a  straightforward 
manner,  the  bootstrap  approximation  for  the  distribution  of  Xn  by  that  of 
A*  may  not  be  so  good  (see  Remark  18.1).  The  bootstrap  approximation  will 
improve  if  we  approximate  the  distribution  of  the  centered  sample  mean: 

X^n 

where  /r  is  the  expectation  corresponding  to  F.  The  bootstrapped  version 
would  be  the  random  variable 


where  fi*  is  the  expectation  corresponding  to  F.  Often  the  bootstrap  approx¬ 
imation  of  the  distribution  of  a  sample  statistic  will  improve  if  we  somehow 
normalize  the  sample  statistic  by  relating  it  to  a  corresponding  feature  of  the 
“true”  distribution.  An  example  is  the  centered  sample  median 

Med(Ai,A2,...,A„)- A“"(0.5), 

where  we  subtract  the  median  T'™''(0.5)  of  F.  Another  example  is  the  nor¬ 
malized  sample  variance 


where  we  divide  by  the  variance  cr^  of  F. 

Quick  exercise  18.1  Describe  how  the  bootstrap  principle  should  be  applied 
to  approximate  the  distribution  of  Med(Ai,  A2, . . . ,  A„)  —  F“''(0.5). 


Remark  18.1  (The  bootstrap  for  the  sample  mean).  To  see  why 

the  bootstrap  approximation  for  Xn  may  be  bad,  consider  a  dataset 
xi,X2,  ■  ■  ■  ,Xn  that  is  a  realization  of  a  random  sample  Ai,  A2, . . . ,  A„  from 
an  1)  distribution.  In  that  case  the  corresponding  sample  mean  A„ 
has  an  A(/r,  1/n)  distribution.  We  estimate  /r  by  Xn  and  replace  the  ran¬ 
dom  sample  from  an  A(/r,  1)  distribution  by  a  bootstrap  random  sample 
A*,  AJ , . . . ,  Ajl  from  an  N{x„,l)  distribution.  The  corresponding  boot¬ 
strapped  sample  mean  X*  has  an  A(a;„,l/n)  distribution.  Therefore  the 
distribution  functions  Gn  and  Gn  of  the  random  variables  A„  and  X^  can 
be  determined: 

G„(a)  =  3>(v^(a  — /i))  and  Gn{a)  ^  —  x„)). 

In  this  case  it  turns  out  that  the  maximum  distance  between  the  two  dis¬ 
tribution  functions  is  equal  to 

2$  [^^\xn  -  mI)  -  1- 
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Since  has  an  X(^,  1/n)  distribution,  this  value  is  approximately  equal  to 
2$  (|2|/2)  —  1,  where  2  is  a  realization  of  an  N{0, 1)  random  variable  Z.  This 
only  equals  zero  for  2  =  0,  so  that  the  distance  between  the  distribution 
functions  of  Xn  and  will  almost  always  be  strictly  positive,  even  for 
large  n. 

The  question  that  remains  is  what  to  take  as  an  estimate  F  for  F.  This 
will  depend  on  how  well  F  can  be  specified.  For  the  Old  Faithful  data  we 
cannot  say  anything  about  the  type  of  distribution.  However,  for  the  software 
data  it  seems  reasonable  to  model  the  dataset  as  a  realization  of  a  random 
sample  from  an  Exp{X)  distribution  and  then  we  only  have  to  estimate  the 
parameter  A.  Different  assumptions  about  F  give  rise  to  different  bootstrap 
procedures.  We  will  discuss  two  of  them  in  the  next  sections. 


18.2  The  empirical  bootstrap 

Suppose  we  consider  our  dataset  Xi^X2,  ■  ■  ■  ,Xn  as  a  realization  of  a  random 
sample  from  a  distribution  function  F .  When  we  cannot  make  any  assumptions 
about  the  type  of  F ,  we  can  always  estimate  F  by  the  empirical  distribution 
function  of  the  dataset: 

~  ,  „  /  \  number  of  Xi  less  than  or  equal  to  a 

F[a)  =  Fn[a)  =  - . 

n 

Since  we  estimate  F  by  the  empirical  distribution  function,  the  corresponding 
bootstrap  principle  is  called  the  empirical  bootstrap.  Applying  this  principle 
to  the  centered  sample  mean,  the  random  sample  Xi,  A2, . . .  from  F  is 
replaced  by  a  bootstrap  random  sample  X*,X2,  ■  ■  ■  ,X*  from  Fn,  and  the 
distribution  of  Xn  —  fJ.  is  approximated  by  that  of  X*  —  pL* ,  where  p,*  denotes 
the  expectation  corresponding  to  The  question  is,  of  course,  how  good 
this  approximation  is.  A  mathematical  theorem  tells  us  that  the  empirical 
bootstrap  works  for  the  centered  sample  mean,  i.e.,  the  distribution  of  X„  —  p 
is  well  approximated  by  that  of  X*  —  p*  (see  Remark  18.2).  On  the  other  hand, 
there  are  (normalized)  sample  statistics  for  which  the  empirical  bootstrap  fails, 
such  as 

maximum  of  Xi ,  X2, . . . ,  X„ 

^  9  ’ 

based  on  a  random  sample  Xi,  X2,  ■  ■  ■ ,  Xn  from  a  U{0,9)  distribution  (see 
Exercise  18.12). 

Remark  18.2  (The  empirical  bootstrap  for  Xn—p)-  For  the  centered 
sample  mean  the  bootstrap  approximation  works,  even  if  we  estimate  F 
by  the  empirical  distribution  function  Fn-  If  Gn  denotes  the  distribution 
function  of  X„  —  p  and  the  distribution  function  of  its  bootstrapped 
version  X*  —  p* ,  then  the  maximum  distance  between  and  Gn  goes  to 
zero  with  probability  one: 
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P  lim  sup  |G*  (t)  -  G„(t)|  =  0  =  1 

(see,  for  instance,  Singh  [32]).  In  fact,  the  empirical  bootstrap  approxima¬ 
tion  can  be  improved  by  approximating  the  distribution  of  the  standardized 
average  —  [i)l a  by  its  bootstrapped  version  ^/n{X*  —  fj,*)/ a* ,  where 

a  and  a*  denote  the  standard  deviations  of  F  and  Fn.  This  approximation 
is  even  better  than  the  normal  approximation  by  the  central  limit  theorem! 

See,  for  instance,  Hall  [14]. 

Let  us  continue  with  approximating  the  distribution  of  —  /i  by  that  of 
X*  —  .  First  note  that  the  empirical  distribution  function  Fn  of  the  original 

dataset  is  the  distribution  function  of  a  discrete  random  variable  that  attains 
the  values  Xi,  X2,  ■  ■  ■ ,  Xn,  each  with  probability  1/n.  This  means  that  each  of 
the  bootstrap  random  variables  X*  has  expectation 

/r*  =  E  [X* ]  =  Xi  ■ - \-  X2- - \-  ■  ■  ■  +  Xn  •  —  =  Xn- 

n  n  n 

Therefore,  applying  the  empirical  bootstrap  to  Xn  —  n  means  approximating 
its  distribution  by  that  of  X*  —  Xn-  In  principle  it  would  be  possible  to  deter¬ 
mine  the  probability  distribution  of  X*  —  Xn-  Indeed,  the  random  variable  X* 
is  based  on  the  random  variables  X*,  whose  distribution  we  know  precisely: 
it  takes  values  xi,  X2, . . . ,  a;„  with  equal  probability  1/n.  Hence  we  could  de¬ 
termine  the  possible  values  of  X*  —  Xn  and  the  corresponding  probabilities. 
For  small  n  this  can  be  done  (see  Exercise  18.5),  but  for  large  n  this  becomes 
cumbersome.  Therefore  we  invoke  a  second  approximation. 

Recall  the  jury  example  in  Section  6.3,  where  we  investigated  the  variation 
of  two  different  rules  that  a  jury  might  use  to  assign  grades.  In  terms  of 
the  present  chapter,  the  jury  example  deals  with  a  random  sample  from  a 
C/(— 0.5,0.5)  distribution  and  two  different  sample  statistics  T  and  M,  cor¬ 
responding  to  the  two  rules.  To  investigate  the  distribution  of  T  and  M, 
a  simulation  was  carried  out  with  one  thousand  runs,  where  in  every  run  we 
generated  a  realization  of  a  random  sample  from  the  f7(— 0.5,  0.5)  distribution 
and  computed  the  corresponding  realization  of  T  and  XI.  The  one  thousand 
realizations  give  a  good  impression  of  how  T  and  M  vary  around  the  deserved 
score  (see  Figure  6.4). 

Returning  to  the  distribution  of  X*  —  Xn,  the  analogue  would  be  to  repeatedly 
generate  a  realization  of  the  bootstrap  random  sample  from  Fn  and  every  time 
compute  the  corresponding  realization  of  X*  —  Xn-  The  resulting  realizations 
would  give  a  good  impression  about  the  distribution  of  X*  —  Xn-X  realization 
of  the  bootstrap  random  sample  is  called  a  bootstrap  dataset  and  is  denoted 

by 

to  distinguish  it  from  the  original  dataset  xi,  X2,  ■  ■  ■ ,  Xn-  For  the  centered 
sample  mean  the  simulation  procedure  is  as  follows. 
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Empirical  bootstrap  simulation  (for  Xn—fJ.)-  Given  a  dataset 
xi,X2,  ■  ■  ■  ,Xn,  determine  its  empirical  distribution  function  as  an 
estimate  of  F,  and  compute  the  expectation 

^  _  Xi  +  X2  +  ■  ■  ■  +  Xn 

H  =  Xn  =  - 

n 

corresponding  to  Fn- 

1.  Generate  a  bootstrap  dataset  a;*,  ,  x*  from  Fn- 

2.  Gompute  the  centered  sample  mean  for  the  bootstrap  dataset: 

where 

_  a:^  +  a:^  +  •  •  •  +  a;; 
n 

Repeat  steps  1  and  2  many  times. 


Note  that  generating  a  value  x*  from  is  equivalent  to  choosing  one  of  the 
elements  a;i,  0:2, . . . ,  a:„  of  the  original  dataset  with  equal  probability  1/n. 

The  empirical  bootstrap  simulation  is  described  for  the  centered  sample  mean, 
but  clearly  a  similar  simulation  procedure  can  be  formulated  for  any  (normal¬ 
ized)  sample  statistic. 

Remark  18.3  (Some  history).  Although  Efron  [7]  in  1979  drew  attention 
to  diverse  applications  of  the  empirical  bootstrap  simulation,  it  already 
existed  before  that  time,  but  not  as  a  unified  widely  applicable  technique. 

See  Hall  [14]  for  references  to  earlier  ideas  along  similar  lines  and  to  further 
development  of  the  bootstrap.  One  of  Efron’s  contributions  was  to  point  out 
how  to  combine  the  bootstrap  with  modern  computational  power.  In  this 
way,  the  interest  in  this  procedure  is  a  typical  consequence  of  the  influence  of 
computers  on  the  development  of  statistics  in  the  past  decades.  Efron  also 
coined  the  term  “bootstrap,”  which  is  inspired  by  the  American  version 
of  one  of  the  tall  stories  of  the  Baron  von  Miinchhausen,  who  claimed  to 
have  lifted  himself  out  of  a  swamp  by  pulling  the  strap  on  his  boot  (in  the 
European  version  he  lifted  himself  by  pulling  his  hair) . 

Quick  exercise  18.2  Describe  the  empirical  bootstrap  simulation  for  the 
centered  sample  median  Med(Xi,  AI2, . . . ,  Xn)  —  E“'’(0.5). 

For  the  Old  Faithful  data  we  carried  out  the  empirical  bootstrap  simulation 
for  the  centered  sample  mean  with  one  thousand  repetitions.  In  Figure  18.1 
a  histogram  (left)  and  kernel  density  estimate  (right)  are  displayed  of  one 
thousand  centered  bootstrap  sample  means 

^  ^  ...  ^ 

•^n,l  •*'71,2  ‘*'71  ■*'71,1000  ■*'"■ 
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Fig.  18.1.  Histogram  and  kernel  density  estimate  of  centered  bootstrap  sample 
means. 


Since  these  are  realizations  of  the  random  variable  X*  —  Xn,  we  know  from 
Section  17.2  that  they  reflect  the  distribution  of  X*  —  Xn-  Hence,  as  the  dis¬ 
tribution  of  X*  —  Xn  approximates  that  of  Xn  —  fJ-,  the  centered  bootstrap 
sample  means  also  reflect  the  distribution  of  /i.  This  leads  to  the  following 
application. 

An  application  of  the  empirical  bootstrap 

Let  us  return  to  our  example  about  the  Old  Faithful  data,  which  are  mod¬ 
eled  as  a  realization  of  a  random  sample  from  some  F.  Suppose  we  estimate 
the  expectation  /r  corresponding  to  F  hy  Xn  =  209.3.  Can  we  say  how  far 
away  209.3  is  from  the  “true”  expectation  /i?  To  be  honest,  the  answer  is 
no. . .  (oops).  In  a  situation  like  this,  the  measurements  and  their  correspond¬ 
ing  average  are  subject  to  randomness,  so  that  we  cannot  say  anything  with 
absolute  certainty  about  how  far  away  the  average  will  be  from  fj,.  One  of  the 
things  we  can  say  is  how  likely  it  is  that  the  average  is  within  a  given  distance 
from  /r. 

To  get  an  impression  of  how  close  the  average  of  a  dataset  of  n  =  272  ob¬ 
served  durations  of  the  Old  Faithful  geyser  is  to  /x,  we  want  to  compute  the 
probability  that  the  sample  mean  deviates  more  than  5  from  /x: 

P(|A„-^|  >5). 

Direct  computation  of  this  probability  is  impossible,  because  we  do  not  know 
the  distribution  of  the  random  variable  /x.  However,  since  the  distribution 
of  X*  —  Xn  approximates  the  distribution  of  Xn  —  /x,  we  can  approximate  the 
probability  as  follows 

P(|A„  -  ^1  >  5)  «  P(|A;  -  Xn\  >  5)  =  P(|A:  -  209.31  >  5) , 
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where  we  have  also  used  that  for  the  Old  Faithful  data,  =  209.3.  As  we 
mentioned  before,  in  principle  it  is  possible  to  compute  the  last  probability 
exactly.  Since  this  is  too  cumbersome,  we  approximate  P(|A*  —  209. 3|  >  5) 
by  means  of  the  one  thousand  centered  bootstrap  sample  means  obtained  from 
the  empirical  bootstrap  simulation: 

<1-  209.3  <2  -  209.3  •••  <iooo  -  209.3. 

In  view  of  Table  17.2,  a  natural  estimate  for  P(|A*  —  209. 3|  >  5)  is  the  relative 
frequency  of  centered  bootstrap  sample  means  that  are  greater  than  5  in 
absolute  value: 

number  of  i  with  ji*  ^  —  209. 3|  greater  than  5 
1000  ■ 

For  the  centered  bootstrap  sample  means  of  Figure  18.1,  this  relative  fre¬ 
quency  is  0.227.  Hence,  we  obtain  the  following  bootstrap  approximation 

P(K  -  ^1  >  5)  «  P(K  -  209.31  >  5)  «  0.227. 

It  should  be  emphasized  that  the  second  approximation  can  be  made  ar¬ 
bitrarily  accurate  by  increasing  the  number  of  repetitions  in  the  bootstrap 
procedure. 


18.3  The  parametric  bootstrap 

Suppose  we  consider  our  dataset  as  a  realization  of  a  random  sample  from  a 
distribution  of  a  specific  parametric  type.  In  that  case  the  distribution  function 
is  completely  determined  by  a  parameter  or  vector  of  parameters  9:  F  =  Fg. 
Then  we  do  not  have  to  estimate  the  whole  distribution  function  F,  but  it 
suffices  to  estimate  the  parameter(vector)  9  hy  9  and  estimate  F  by 

F  =  Fg. 

The  corresponding  bootstrap  principle  is  called  the  parametric  bootstrap. 

Let  us  investigate  what  this  would  mean  for  the  centered  sample  mean.  First 
we  should  realize  that  the  expectation  of  Fg  is  also  determined  hy  9:  pL  =  p,g. 
The  parametric  bootstrap  for  the  centered  sample  mean  now  amounts  to  the 
following.  The  random  sample  Xi,  X2,  ■  ■  ■ ,  Xn  from  the  “true”  distribution 
function  Fg  is  replaced  by  a  bootstrap  random  sample  X*,  X2,  ■  ■  ■ ,  X*  from 
Fg,  and  the  probability  distribution  of  <  —  pig  is  approximated  by  that  of 
X*  —  pL* ,  where 

M*  =  9-g 

denotes  the  expectation  corresponding  to  Fg. 

Often  the  parametric  bootstrap  approximation  is  better  than  the  empirical 
bootstrap  approximation,  as  illustrated  in  the  next  quick  exercise. 
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Quick  exercise  18.3  Suppose  the  dataset  xi,X2,  ■  ■  ■  ,Xn  is  a  realization  of  a 
random  sample  Xi,  X2, . . . ,  Xn  from  an  1)  distribution.  Estimate  /i  by 
Xn  and  consider  a  bootstrap  random  sample  X*,  , . . . ,  X*  from  an  N{xn,  1) 

distribution.  Check  that  the  probability  distributions  of  Xn  —  and  X*  —  Xn 
are  the  same:  an  iV(0, 1/n)  distribution. 

Once  more,  in  principle  it  is  possible  to  determine  the  distribution  of  X*  — 
exactly.  However,  in  contrast  with  the  situation  considered  in  the  previous 
quick  exercise,  in  some  cases  this  is  still  cumbersome.  Again  a  simulation 
procedure  may  help  us  out.  For  the  centered  sample  mean  the  procedure  is  as 
follows. 


Parametric  bootstrap  simulation  (for  Xn  —  fj.)-  Given  a 
dataset  xi,X2,  ■  ■  ■  ,Xn,  compute  an  estimate  9  for  0.  Determine  Fg 
as  an  estimate  for  Fg,  and  compute  the  expectation  fj,*  =  fig  corre¬ 
sponding  to  Fg. 

1 .  Generate  a  bootstrap  dataset  x* ,  , . . . ,  a;*  from  Fg . 

2.  Compute  the  centered  sample  mean  for  the  bootstrap  dataset: 

~  9‘gi 

where 

-*  _  +  X2  +  ■  ■  ■  +  Xn 

Xn  ■ 

n 

Repeat  steps  1  and  2  many  times. 


As  an  application  we  will  use  the  parametric  bootstrap  simulation  to  investi¬ 
gate  whether  the  exponential  distribution  is  a  reasonable  model  for  the  soft¬ 
ware  data. 

Are  the  software  data  exponential? 

Consider  fitting  an  exponential  distribution  to  the  software  data,  as  discussed 
in  Section  17.3.  At  first  sight.  Figure  17.6  shows  a  reasonable  fit  with  the  ex¬ 
ponential  distribution.  One  way  to  quantify  the  difference  between  the  dataset 
and  the  exponential  model  is  to  compute  the  maximum  distance  between  the 
empirical  distribution  function  Fn  of  the  dataset  and  the  exponential  distri¬ 
bution  function  F^  estimated  from  the  dataset: 

tks  =  sup  \Fn{a)  -  Fj,(a)|. 
aGR 

Here  F^{a)  =  0  for  a  <  0  and 

F3^(a)  =  1 -e-^“  fora>0, 

where  A  =  1/xn  is  estimated  from  the  dataset.  The  quantity  is  called  the 
Kolmogorov- Smirnov  distance  between  Fn  and  F^. 
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The  idea  behind  the  use  of  this  distance  is  the  following.  If  F  denotes  the 
“true”  distribution  function,  then  according  to  Section  17.2  the  empirical 
distribution  function  Fn  will  resemble  F  whether  F  equals  the  distribution 
function  F\  of  some  Exp{X)  distribution  or  not.  On  the  other  hand,  if  the 
“true”  distribution  function  is  F\,  then  the  estimated  exponential  distribu¬ 
tion  function  F^  will  resemble  F\,  because  A  =  l/xn  is  close  to  the  “true”  A. 
Therefore,  if  F  =  F\,  then  both  Fn  and  F^  will  be  close  to  the  same  distribu¬ 
tion  function,  so  that  tks  is  small;  if  F  is  different  from  F\,  then  Fn  and  F^ 
are  close  to  two  different  distribution  functions,  so  that  tks  is  large.  The  value 
tks  is  always  between  0  and  1,  and  the  further  away  this  value  is  from  0,  the 
more  it  is  an  indication  that  the  exponential  model  is  inappropriate.  For  the 
software  dataset  we  find  A  =  l/xn  =  0.0015  and  fks  =  0.176.  Does  this  speak 
against  the  believed  exponential  model? 

One  way  to  investigate  this  is  to  find  out  whether,  in  the  case  when  the  data  are 
truly  a  realization  of  an  exponential  random  sample  from  Fx,  the  value  0.176  is 
unusually  large.  To  answer  this  question  we  consider  the  sample  statistic  that 
corresponds  to  tks  •  The  estimate  A  =  1  /a;„  is  replaced  by  the  random  variable 
A  =  1/Xn,  and  the  empirical  distribution  function  of  the  dataset  is  replaced 
by  the  empirical  distribution  function  of  the  random  sample  Xi,  X2, . . . ,  Xn 
(again  denoted  by  Fn). 


Fn{a) 


number  of  Xi  less  than  or  equal  to  a 
n 


In  this  way,  tks  is  a  realization  of  the  sample  statistic 


Tks  =  sup  \Fn{a)  -  F^(a)|. 

aeR 

To  find  out  whether  0.176  is  an  exceptionally  large  value  for  the  random  vari¬ 
able  Tks,  we  must  determine  the  probability  distribution  of  Tks.  However,  this 
is  impossible  because  the  parameter  A  of  the  Exp{X)  distribution  is  unknown. 
We  will  approximate  the  distribution  of  Tks  by  a  parametric  bootstrap.  We  use 
the  dataset  to  estimate  A  by  A  =  l/a:„  =  0.0015  and  replace  the  random  sam¬ 
ple  Xi,X2t  ■  ■ ,  Xn  from  Fx  by  a  bootstrap  random  sample  ATJ',  , . . . ,  X* 
from  F^.  Next  we  approximate  the  distribution  of  Tks  by  that  of  its  boot¬ 
strapped  version 

Ti  =  sup|T:(a)-T^.(a)|, 

aeR 

where  F*  is  the  empirical  distribution  function  of  the  bootstrap  random  sam¬ 
ple: 

^ .  ,  ,  number  of  Xf  less  than  or  equal  to  a 

K{a)  =  - ^ ^ - , 

n 

and  A*  =  l/Xn,  with  X*  being  the  average  of  the  bootstrap  random  sample. 
The  bootstrapped  sample  statistic  is  too  complicated  to  determine  its 
probability  distribution,  and  hence  we  perform  a  parametric  bootstrap  simu¬ 
lation: 
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1.  We  generate  a  bootstrap  dataset  xj,  ,  a;J35  from  an  exponential  dis¬ 

tribution  with  parameter  A  =  0.0015. 

2.  We  compute  the  bootstrapped  KS  distance 

tks  =  sup|F*(a)  -  F^,{a)\, 

oGR 

where  F*  denotes  the  empirical  distribution  function  of  the  bootstrap 
dataset  and  F^^  denotes  the  estimated  exponential  distribution  function, 
where  A*  =  1/x^  is  computed  from  the  bootstrap  dataset. 

We  repeat  steps  1  and  2  one  thousand  times,  which  results  in  one  thousand 
values  of  the  bootstrapped  KS  distance.  In  Figure  18.2  we  have  displayed  a 
histogram  and  kernel  density  estimate  of  the  one  thousand  bootstrapped  KS 
distances.  It  is  clear  that  if  the  software  data  would  come  from  an  exponential 
distribution,  the  value  0.176  of  the  KS  distance  would  be  very  unlikely!  This 
strongly  suggests  that  the  exponential  distribution  is  not  the  right  model  for 
the  software  data.  The  reason  for  this  is  that  the  Poisson  process  is  the  wrong 
model  for  the  series  of  failures.  A  closer  inspection  shows  that  the  rate  at 
which  failures  occur  over  time  is  not  constant,  as  was  assumed  in  Chapter  17, 
but  decreases. 


Fig.  18.2.  One  thousand  bootstrapped  KS  distances. 


18.4  Solutions  to  the  quick  exercises 

18.1  You  could  have  written  something  like  the  following:  “Use  the  dataset 
xi,  X2,  ■  ■  ■ ,  Xn  to  compute  an  estimate  F  for  F.  Replace  the  random  sample 
Xi,  A2, . . . ,  Xn  from  F  by  a  random  sample  AJ",  A|, . . . ,  X*  from  F,  and 
approximate  the  probability  distribution  of 
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Med{Xi,  X2,...,  Xn)  -  (0.5) 

by  that  of  Med{Xl,X2,  ■■■  ,X*)  —  where  F™''{0.5)  is  the  median 

ofF.” 

18.2  You  could  have  written  something  like  the  following:  “Given  a  dataset 
xi,  X2,  ■  ■  ■ ,  Xn,  determine  its  empirical  distribution  function  F^  as  an  estimate 
of  F,  and  the  median  ^’'''''(0.5)  o/f„. 

1.  Generate  a  bootstrap  dataset  x*,  ,  a;*  from  Fn. 

2.  Gompute  the  sample  median  for  the  bootstrap  dataset: 

Me<;-r""(0.5), 

where  Med*^  =  sample  median  0/  x* ,  X2 , . . . ,  x*  . 

Repeat  steps  1  and  2  many  times.  ” 

Note  that  if  n  is  odd,  then  F™''(0.5)  equals  the  sample  median  of  the  original 
dataset,  but  this  is  not  necessarily  so  for  n  even. 

18.3  According  to  Remark  11.2  about  the  sum  of  independent  normal  ran¬ 
dom  variables,  the  sum  of  n  independent  1)  distributed  random  variables 
has  an  N(np,  n)  distribution.  Hence  by  the  change-of-units  rule  for  the  normal 
distribution  (see  page  106),  it  follows  that  has  an  N{p.,l/n)  distribution, 
and  that  —  p.  has  an  A(0, 1/n)  distribution.  Similarly,  the  average  X*  of 
n  independent  A(x„,l)  distributed  bootstrap  random  variables  has  a  nor¬ 
mal  distribution  N{xn,  1/n)  distribution,  and  therefore  Xf  —  x„  again  has  an 
A(0, 1/n)  distribution. 


18.5  Exercises 

18.1  □  We  generate  a  bootstrap  dataset  x*,X2,...,Xg  from  the  empirical 
distribution  function  of  the  dataset 

2  1  1  4  6  3, 

i.e.,  we  draw  (with  replacement)  six  values  from  these  numbers  with  equal 
probability  1/6.  How  many  different  bootstrap  datasets  are  possible?  Are 
they  all  equally  likely  to  occur? 

18.2  We  generate  a  bootstrap  dataset  X2,  Xg,  X4  from  the  empirical  distri¬ 
bution  function  of  the  dataset 


1  3  4  6. 


a.  Compute  the  probability  that  the  bootstrap  sample  mean  is  equal  to  1. 
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b.  Compute  the  probability  that  the  maximum  of  the  bootstrap  dataset  is 
equal  to  6. 

c.  Compute  the  probability  that  exactly  two  elements  in  the  bootstrap  sam¬ 
ple  are  less  than  2. 

18.3  ffl  We  generate  a  bootstrap  dataset  xj,  ,  Xio  from  the  empirical 

distribution  function  of  the  dataset 

0.39  0.41  0.38  0.44  0.40 
0.36  0.34  0.46  0.35  0.37. 

a.  Compute  the  probability  that  the  bootstrap  dataset  has  exactly  three 
elements  equal  to  0.35. 

b.  Compute  the  probability  that  the  bootstrap  dataset  has  at  most  two  ele¬ 
ments  less  than  or  equal  to  0.38. 

c.  Compute  the  probability  that  the  bootstrap  dataset  has  exactly  two  ele¬ 
ments  less  than  or  equal  to  0.38  and  all  other  elements  greater  than  0.42. 

18.4  □  Consider  the  dataset  from  Exercise  18.3,  with  maximum  0.46. 

a.  We  generate  a  bootstrap  random  sample  ■  ■  ■ ,  X*q  from  the  empir¬ 

ical  distribution  function  of  the  dataset.  Compute  P(Mj*Q  <  0.46),  where 
-^10  “  maxjXJ' ,  X2 ,  ■ . . , 

b.  The  same  question  as  in  a,  but  now  for  a  dataset  with  distinct  elements 

Xi,X2t  ■  ■  ■  ^Xn  and  maximum  rrin-  Compute  P(M*  <  rrin),  where  M*  is 
the  maximum  of  a  bootstrap  random  sample  . . .  ,X*  generated 

from  the  empirical  distribution  function  of  the  dataset. 

18.5  □  Suppose  we  have  a  dataset 

0  3  6, 

which  is  the  realization  of  a  random  sample  from  a  distribution  function  F.  If 
we  estimate  F  by  the  empirical  distribution  function,  then  according  to  the 
bootstrap  principle  applied  to  the  centered  sample  mean  X3  —  we  must 
replace  this  random  variable  by  its  bootstrapped  version  X^  —  X3 .  Determine 
the  possible  values  for  the  bootstrap  random  variable  —  X3  and  the  corre¬ 
sponding  probabilities. 

18.6  Suppose  that  the  dataset  xi,X2,---,Xn  is  a  realization  of  a  random 
sample  from  an  Exp{X)  distribution  with  distribution  function  E\,  and  that 
Xn  =  5. 

a.  Check  that  the  median  of  the  Exp{X)  distribution  is  m\  =  (In2)/A  (see 
also  Exercise  5.11). 

b.  Suppose  we  estimate  A  by  l/a;„.  Describe  the  parametric  bootstrap  sim¬ 
ulation  for  Med(Xi,  X2, . . . ,  Xn)  —  m\. 
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18.7  ffl  To  give  an  example  in  which  the  bootstrapped  centered  sample  mean 
in  the  parametric  and  empirical  bootstrap  simulations  may  be  different,  con¬ 
sider  the  following  situation.  Suppose  that  the  dataset  Xi,X2,  ■  ■  ■  ,Xn  is  a  re¬ 
alization  of  a  random  sample  from  a  U{0,9)  distribution  with  expectation 
/r  =  9/2.  We  estimate  9  by 


n 

where  m„  =  max{a;i, a;2, . . .  ,Xn}-  Describe  the  parametric  bootstrap  simula¬ 
tion  for  the  centered  sample  mean 

18.8  ffl  Here  is  an  example  in  which  the  bootstrapped  centered  sample  mean  in 
the  parametric  and  empirical  bootstrap  simulations  are  the  same.  Consider  the 
software  data  with  average  Xn  =  656.8815  and  median  m„  =  290,  modeled  as 
a  realization  of  a  random  sample  Xi,  X2,  ■ . . ,  Xn  from  a  distribution  function 
F  with  expectation  /i.  By  means  of  bootstrap  simulation  we  like  to  get  an 
impression  of  the  distribution  of  Xn  —  fJ- 

a.  Suppose  that  we  assume  nothing  about  the  distribution  of  the  interfailure 
times.  Describe  the  appropriate  bootstrap  simulation  procedure  with  one 
thousand  repetitions. 

b.  Suppose  we  assume  that  F  is  the  distribution  function  of  an  Exp{X)  distri¬ 
bution,  where  A  is  estimated  by  l/xn  =  0.0015.  Describe  the  appropriate 
bootstrap  simulation  procedure  with  one  thousand  repetitions. 

c.  Suppose  we  assume  that  F  is  the  distribution  function  of  an  Exp{X)  dis¬ 
tribution,  and  that  (as  suggested  by  Exercise  18.6  a)  the  parameter  A 
is  estimated  by  (ln2)/m„  =  0.0024.  Describe  the  appropriate  bootstrap 
simulation  procedure  with  one  thousand  repetitions. 

18.9  □  Consider  the  dataset  from  Exercises  15.1  and  17.6  consisting  of  mea¬ 

sured  chest  circumferences  of  Scottish  soldiers  with  average  Xn  =  39.85  and 
sample  standard  deviation  =  2.09.  The  histogram  in  Figure  17.11  suggests 
modeling  the  data  as  the  realization  of  a  random  sample  Xi,  X2, . .  ■ ,  Xn  from 
an  (T^)  distribution.  We  estimate  p,  by  the  sample  mean  and  we  are  inter¬ 
ested  in  the  probability  that  the  sample  mean  deviates  more  than  1  from  p: 
P(|X„  —  ^1  >  l).  Describe  how  one  can  use  the  bootstrap  principle  to  approx¬ 
imate  this  probability,  i.e.,  describe  the  distribution  of  the  bootstrap  random 
sample  X/,X2, . .  ■  ,X*  and  compute  >  l)-  Note  that  one  does 

not  need  a  simulation  to  approximate  this  latter  probability. 

18.10  Consider  the  software  data,  with  average  Xn  =  656.8815,  modeled  as 
a  realization  of  a  random  sample  Xi,  X2, . . . ,  Xn  from  a  distribution  func¬ 
tion  F.  We  estimate  the  expectation  ^  of  E  by  the  sample  mean  and  we  are 
interested  in  the  probability  that  the  sample  mean  deviates  more  than  ten 
from  p:  P(|X„  —  p\  >  lO). 
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a.  Suppose  we  assume  nothing  about  the  distribution  of  the  interfailure 
times.  Describe  how  one  can  obtain  a  bootstrap  approximation  for  the 
probability,  i.e.,  describe  the  appropriate  bootstrap  simulation  procedure 
with  one  thousand  repetitions  and  how  the  results  of  this  simulation  can 
be  used  to  approximate  the  probability. 

b.  Suppose  we  assume  that  F  is  the  distribution  function  of  an  Exp{X)  dis¬ 
tribution.  Describe  how  one  can  obtain  a  bootstrap  approximation  for  the 
probability. 

18.11  Consider  the  dataset  of  measured  chest  circumferences  of  5732  Scottish 
soldiers  (see  Exercises  15.1, 17.6,  and  18.9).  The  Kolmogorov-Smirnov  distance 
between  the  empirical  distribution  function  and  the  distribution  function 
Fx„,sn  of  fho  normal  distribution  with  estimated  parameters  fl  =  Xn  =  39.85 
and  (7  =  =  2.09  is  equal  to 


tks  =  sup  \Fn{a)  -  Fs;„,s„{a)\  =  0.0987, 


where  Xn  and  denote  sample  mean  and  sample  standard  deviation  of  the 
dataset.  Suppose  we  want  to  perform  a  bootstrap  simulation  with  one  thou¬ 
sand  repetitions  for  the  KS  distance  to  investigate  to  which  degree  the  value 
0.0987  agrees  with  the  assumed  normality  of  the  dataset.  Describe  the  appro¬ 
priate  bootstrap  simulation  that  must  be  carried  out. 

18.12  To  give  an  example  where  the  empirical  bootstrap  fails,  consider  the 
following  situation.  Suppose  our  dataset  xi,X2,  ■  ■  ■  ,Xn  is  a  realization  of  a 
random  sample  Xi,X2,  ■  ■  ■ ,  Xn  from  a  U{0,  6)  distribution.  Consider  the  nor¬ 
malized  sample  statistic 


where  Mn  is  the  maximum  of  Xi ,  X2, . . . ,  Xn-  Let  X^,  X|, . . . ,  X*  be  a  boot¬ 
strap  random  sample  from  the  empirical  distribution  function  of  our  dataset, 
and  let  M*  be  the  corresponding  bootstrap  maximum.  We  are  going  to  com¬ 
pare  the  distribution  functions  of  Tn  and  its  bootstrap  counterpart 


where  to„  is  the  maximum  of  xi,  2:2, . . . ,  a;„. 
a.  Check  that  P(T„  <  0)  =  0  and  show  that 


Hint:  first  argue  that  P{T*  <  0)  =  P(M*  =  to„),  and  then  use  the  result 
of  Exercise  18.4. 
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b.  Let  Gn{t)  =  P(T„  <  t)  be  the  distribution  function  of  Tn,  and  similarly  let 
G*  (t)  =  P  (T*  <  t)  be  the  distribution  function  of  the  bootstrap  statistic 
T*.  Conclude  from  part  a  that  the  maximum  distance  between  G*  and 
Gn  can  be  bounded  from  below  as  follows: 

sup|G;(t)  -  G„(f)|  >  1  -  ("l  -  • 

tsR  V 

c.  Use  part  b  to  argue  that  for  all  n,  the  maximum  distance  between  G* 
and  Gn  is  greater  than  0.632: 

sup \G*n(t)  -  Gn(t)\  >  1  -  e"^  =  0.632. 
tGR 

Hint:  you  may  use  that  e““  >  1  —  a;  for  all  x. 

We  conclude  that  even  for  very  large  sample  sizes  the  maximum  distance 
between  the  distribution  functions  of  Tn  and  its  bootstrap  counterpart  T* 
is  at  least  0.632. 


18.13  (Exercise  18.12  continued).  In  contrast  to  the  empirical  bootstrap,  the 
parametric  bootstrap  for  Tn  does  work.  Suppose  we  estimate  the  parameter  9 
of  the  U{0,9)  distribution  by 

0  =  — — TO„,  where  m„  =  maximum  of  xi,  a;2, . . . ,  x„. 
n 


Let  now  . . .  ,X*  be  a  bootstrap  random  sample  from  a  U{0,d)  dis¬ 

tribution,  and  let  M*  be  the  corresponding  bootstrap  maximum.  Again,  we 
are  going  to  compare  the  distribution  function  Gn  of  =  1  —  Mn/9  with  the 
distribution  function  G*  of  its  bootstrap  counterpart  T*  =  1  —  M*/9. 


a.  Check  that  the  distribution  function  Fg  of  a  17(0,0)  distribution  is  given 

by 

Fe{a)  =  ^  for  0  <  a  <  0. 


b.  Check  that  the  distribution  function  of  T„  is 

Gn{t)  =  F(Tn  <  t)  =  1  -  (1  -  t)"  for  0  <  t  <  1. 

Hint:  rewrite  P(Tn  <  t)  as  1  —  P(M„  <  0(1  —  t))  and  use  the  rule  on 
page  109  about  the  distribution  function  of  the  maximum. 

c.  Show  that  T*  has  the  same  distribution  function: 


G*^{t)=P{T*<t)  =  l-{l-t)'^  for0<t<l. 

This  means  that,  in  contrast  to  the  empirical  bootstrap  (see  Exer¬ 
cise  18.12),  the  parametric  bootstrap  works  perfectly  in  this  situation. 
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Unbiased  estimators 


In  Chapter  17  we  saw  that  a  dataset  can  be  modeled  as  a  realization  of  a 
random  sample  from  a  probability  distribution  and  that  quantities  of  interest 
correspond  to  features  of  the  model  distribution.  One  of  our  tasks  is  to  use  the 
dataset  to  estimate  a  quantity  of  interest.  We  shall  mainly  deal  with  the  situ¬ 
ation  where  it  is  modeled  as  one  of  the  parameters  of  the  model  distribution 
or  as  a  certain  function  of  the  parameters.  We  will  first  discuss  what  we  mean 
exactly  by  an  estimator  and  then  introduce  the  notion  of  unbiasedness  as  a 
desirable  property  for  estimators.  We  end  the  chapter  by  providing  unbiased 
estimators  for  the  expectation  and  variance  of  a  model  distribution. 


19.1  Estimators 

Consider  the  arrivals  of  packages  at  a  network  server.  One  is  interested  in  the 
intensity  at  which  packages  arrive  on  a  generic  day  and  in  the  percentage  of 
minutes  during  which  no  packages  arrive.  If  the  arrivals  occur  completely  at 
random  in  time,  the  arrival  process  can  be  modeled  by  a  Poisson  process.  This 
would  mean  that  the  number  of  arrivals  during  one  minute  is  modeled  by  a 
random  variable  having  a  Poisson  distribution  with  (unknown)  parameter  /r. 
The  intensity  of  the  arrivals  is  then  modeled  by  the  parameter  /i  itself,  and 
the  percentage  of  minutes  during  which  no  packages  arrive  is  modeled  by  the 
probability  of  zero  arrivals:  e~^.  Suppose  one  observes  the  arrival  process  for  a 
while  and  gathers  a  dataset  xi,X2^  ■  ■  ■  ,Xn,  where  Xi  represents  the  number  of 
arrivals  in  the  zth  minute.  Our  task  will  be  to  estimate,  based  on  the  dataset, 
the  parameter  /i  and  a  function  of  the  parameter:  e~^. 

This  example  is  typical  for  the  general  situation  in  which  our  dataset  is  mod¬ 
eled  as  a  realization  of  a  random  sample  Xi,X2,  ■  ■  ■  ^Xn  from  a  probability 
distribution  that  is  completely  determined  by  one  or  more  parameters.  The 
parameters  that  determine  the  model  distribution  are  called  the  model  param¬ 
eters.  We  focus  on  the  situation  where  the  quantity  of  interest  corresponds 
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to  a  feature  of  the  model  distribution  that  can  be  described  by  the  model 
parameters  themselves  or  by  some  function  of  the  model  parameters.  This 
distribution  feature  is  referred  to  as  the  parameter  of  interest.  In  discussing 
this  general  setup  we  shall  denote  the  parameter  of  interest  by  the  Greek 
letter  9.  So,  for  instance,  in  our  network  server  example,  /i  is  the  model  pa¬ 
rameter.  When  we  are  interested  in  the  arrival  intensity,  the  role  of  9  is  played 
by  the  parameter  p,  itself,  and  when  we  are  interested  in  the  percentage  of 
idle  minutes  the  role  of  9  is  played  by  e~^. 

Whatever  method  we  use  to  estimate  the  parameter  of  interest  9,  the  result 
depends  only  on  our  dataset. 


Estimate.  An  estimate  is  a  value  t  that  only  depends  on  the  dataset 
xi,X2,  ■  ■  ■  ,Xn,  i.e.,  t  is  some  function  of  the  dataset  only: 

t  =  h{xi,X2,  .  .  .,Xn)- 


This  description  of  estimate  is  a  bit  formal.  The  idea  is,  of  course,  that  the 
value  t,  computed  from  our  dataset  Xi,X2,  ■  ■  ■  ,Xm  gives  some  indication  of 
the  “true”  value  of  the  parameter  9.  We  have  already  met  several  estimates  in 
Chapter  17;  see,  for  instance.  Table  17.2.  This  table  illustrates  that  the  value 
of  an  estimate  can  be  anything:  a  single  number,  a  vector  of  numbers,  even  a 
complete  curve. 

Let  us  return  to  our  network  server  example  in  which  our  dataset  xi^X2,  ■  ■  ■  ,Xn 
is  modeled  as  a  realization  of  a  random  sample  from  a  Pois{p)  distribution. 
The  intensity  at  which  packages  arrive  is  then  represented  by  the  parameter  p. 
Since  the  parameter  p  is  the  expectation  of  the  model  distribution,  the  law 
of  large  numbers  suggests  the  sample  mean  Xn  as  &  natural  estimate  for  p. 
On  the  other  hand,  the  parameter  p  also  represents  the  variance  of  the  model 
distribution,  so  that  by  a  similar  reasoning  another  natural  estimate  is  the 
sample  variance  s^. 

The  percentage  of  idle  minutes  is  modeled  by  the  probability  of  zero  arrivals. 
Similar  to  the  reasoning  in  Section  13.4,  a  natural  estimate  is  the  relative 
frequency  of  zeros  in  the  dataset: 

number  of  Xi  equal  to  zero 
n 

On  the  other  hand,  the  probability  of  zero  arrivals  can  be  expressed  as  a 
function  of  the  model  parameter:  e“^.  Hence,  if  we  estimate  p  by  Xn,  we 
could  also  estimate  by  . 

Quick  exercise  19.1  Suppose  we  estimate  the  probability  of  zero  arrivals 
e“^  by  the  relative  frequency  of  Xi  equal  to  zero.  Deduce  an  estimate  for  p 
from  this. 
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The  preceding  examples  illustrate  that  one  can  often  think  of  several  estimates 
for  the  parameter  of  interest.  This  raises  questions  like 

•  When  is  one  estimate  better  than  another? 

•  Does  there  exist  a  best  possible  estimate? 

For  instance,  can  we  say  which  of  the  values  Xn  or  computed  from  the 
dataset  is  closer  to  the  “true”  parameter  /r?  The  answer  is  no.  The  measure¬ 
ments  and  the  corresponding  estimates  are  subject  to  randomness,  so  that 
we  cannot  say  anything  with  certainty  about  which  of  the  two  is  closer  to  /r. 
One  of  the  things  we  can  say  for  each  of  them  is  how  likely  it  is  that  they  are 
within  a  given  distance  from  fi.  To  this  end,  we  consider  the  random  variables 
that  correspond  to  the  estimates.  Because  our  dataset  xi^X2, . . .  is  mod¬ 
eled  as  a  realization  of  a  random  sample  Xi,X2, . . .  ,Xn,  the  estimate  f  is  a 
realization  of  a  random  variable  T. 


Estimator.  Let  t  =  h{xi,X2,  ■  ■  ■ ,  Xn)  be  an  estimate  based  on  the 
dataset  xi,X2t  ■  ■  ,Xn.  Then  t  is  a  realization  of  the  random  variable 

T=h{Xi,X2,...,X„). 

The  random  variable  T  is  called  an  estimator. 


The  word  estimator  refers  to  the  method  or  device  for  estimation.  This  is 
distinguished  from  estimate,  which  refers  to  the  actual  value  computed  from 
a  dataset.  Note  that  estimators  are  special  cases  of  sample  statistics.  In  the 
remainder  of  this  chapter  we  will  discuss  the  notion  of  unbiasedness  that 
describes  to  some  extent  the  behavior  of  estimators. 


19.2  Investigating  the  behavior  of  an  estimator 

Let  us  continue  with  our  network  server  example.  Suppose  we  have  observed 
the  network  for  30  minutes  and  we  have  recorded  the  number  of  arrivals  in 
each  minute.  The  dataset  is  modeled  as  a  realization  of  a  random  sample 
Xi,  X2, . . . ,  Xn  of  size  n  =  30  from  a  Pois^fi)  distribution.  Let  us  concentrate 
on  estimating  the  probability  po  of  zero  arrivals,  which  is  an  unknown  number 
between  0  and  1.  As  motivated  in  the  previous  section,  we  have  the  following 
possible  estimators: 

„  number  of  Xi  equal  to  zero  ,  „  -  v 

b  = -  and  i  =  e 

n 

Our  first  estimator  S  can  only  attain  the  values  0,  ,  1,  so  that  in 

general  it  cannot  give  the  exact  value  of  pq.  Similarly  for  our  second  estima¬ 
tor  T,  which  can  only  attain  the  values  1,  . . .  .  So  clearly,  we 
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cannot  expect  our  estimators  always  to  give  the  exact  value  of  po  on  basis  of 
30  observations.  Well,  then  what  can  we  expect  from  a  reasonable  estimator? 
To  get  an  idea  of  the  behavior  of  both  estimators,  we  pretend  we  know  p, 
and  we  simulate  the  estimation  process  in  the  case  of  n  =  30  observations. 
Let  us  choose  p  =  In  10,  so  that  po  =  e~^  =  0.1.  We  draw  30  values  from 
a  Poisson  distribution  with  parameter  /i  =  In  10  and  compute  the  value  of 
estimators  S  and  T.  We  repeat  this  500  times,  so  that  we  have  500  values 
for  each  estimator.  In  Figure  19.1  a  frequency  histogram^  of  these  values 
for  estimator  S  is  displayed  on  the  left  and  for  estimator  T  on  the  right. 
Clearly,  the  values  of  both  estimators  vary  around  the  value  0.1,  which  they 
are  supposed  to  estimate. 


0.0  0.1  0.2  0.3  0.0  0.1  0.2  0.3 

Fig.  19.1.  Frequency  histograms  of  500  values  for  estimators  S  (left)  and  T  (right) 
of  Po  =  0.1. 


19.3  The  sampling  distribution  and  unbiasedness 

We  have  just  seen  that  the  values  generated  for  estimator  S  fluctuate  around 
Po  =  0.1.  Although  the  value  of  this  estimator  is  not  always  equal  to  0.1,  it 
is  desirable  that  on  average,  S  is  on  target,  i.e.,  E[5']  =  0.1.  Moreover,  it  is 
desirable  that  this  property  holds  no  matter  what  the  actual  value  of  po  is, 
i.e., 

E[5]=po 

irrespective  of  the  value  0  <  po  <  1-  In  order  to  And  out  whether  this  is 
true,  we  need  the  probability  distribution  of  the  estimator  S.  Of  course  this 

^  In  a  frequency  histogram  the  height  of  each  vertical  bar  equals  the  frequency  of 
values  in  the  corresponding  bin. 
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is  simply  the  distribution  of  a  random  variable,  but  because  estimators  are 
constructed  from  a  random  sample  Xi,  X2, . . . ,  we  speak  of  the  sampling 
distribution. 


The  sampling  distribution.  Let  T  =  h{Xi,X2^ . . .  ,X„)  be  an 
estimator  based  on  a  random  sample  Xi,  X2, . . . ,  X„.  The  probabil¬ 
ity  distribution  of  T  is  called  the  sampling  distribution  of  T. 


The  sampling  distribution  of  S  can  be  found  as  follows.  Write 


where  Y  is  the  number  of  Xi  equal  to  zero.  If  for  each  i  we  label  =  0  as 
a  success,  then  Y  is  equal  to  the  number  of  successes  in  n  independent  trials 
with  po  as  the  probability  of  success.  Similar  to  Section  4.3,  it  follows  that  Y 
has  a  Bin{n,po)  distribution.  Hence  the  sampling  distribution  of  S  is  that  of 
a  Bin{n,po)  distributed  random  variable  divided  by  n.  This  means  that  S  is 
a  discrete  random  variable  that  attains  the  values  k/n,  where  fc  =  0, 1, . . . ,  n, 
with  probabilities  given  by 


The  probability  mass  function  of  S  for  the  case  n  =  30  and  po  =  0.1  is 
displayed  in  Figure  19.2.  Since  S  =  Y/n  and  Y  has  a  Binln^po)  distribution, 
it  follows  that 


E[5]  = 


E[r]  _  npo 


=  Po- 


n  n 

So,  indeed,  the  estimator  S  for  pq  has  the  property  E[S']  =  pq.  This  property 
reflects  the  fact  that  estimator  S  has  no  systematic  tendency  to  produce 
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Fig.  19.2.  Probability  mass  function  of  S. 
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estimates  that  are  larger  than  pq,  and  no  systematic  tendency  to  produce 
estimates  that  are  smaller  than  po-  This  is  a  desirable  property  for  estimators, 
and  estimators  that  have  this  property  are  called  unbiased. 


Definition.  An  estimator  T  is  called  an  unbiased  estimator  for  the 
parameter  6,  if 

E  [T]  =  6» 

irrespective  of  the  value  of  9.  The  difference  Fi[T]  —  9  is  called  the 
bias  of  T;  if  this  difference  is  nonzero,  then  T  is  called  biased. 


Let  us  return  to  our  second  estimator  for  the  probability  of  zero  arrivals  in 
the  network  server  example:  T  =  The  sampling  distribution  can  be 

obtained  as  follows.  Write 

T  = 

where  Z  =  Xi+  X2  H —  •  +  A„.  From  Exercise  12.9  we  know  that  the  random 
variable  Z,  being  the  sum  of  n  independent  Pois{p)  random  variables,  has 
a  Pois{nfj,)  distribution.  This  means  that  T  is  a  discrete  random  variable 
attaining  values  e“^/",  where  fc  =  0, 1, . . .  and  the  probability  mass  function 
of  T  is  given  by 

PT  =  P(r  =  =  P{Z  =  k)  = 

The  probability  mass  function  of  T  for  the  case  n  =  30  and  po  =  0.1  is 
displayed  in  Figure  19.3.  From  the  histogram  in  Figure  19.1  as  well  as  from 
the  probability  mass  function  in  Figure  19.3,  you  may  get  the  impression 
that  T  is  also  an  unbiased  estimator.  However,  this  not  the  case,  which  follows 
immediately  from  an  application  of  Jensen’s  inequality: 
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Fig.  19.3.  Probability  mass  function  of  T. 
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E  [T]  =  E 


.-x„ 


>  e 


-EX„ 


where  we  have  a  strict  inequality  because  the  function  g(x)  =  e~^  is  strictly 
convex  {g"(x)  =  e“®  >  0).  Recall  that  the  parameter  fi  equals  the  expectation 
of  the  Pois{g)  model  distribution,  so  that  according  to  Section  13.1  we  have 
E  [X„]  =  g,.  We  find  that 

E[T]  >e-'^=Po, 


which  means  that  the  estimator  T  for  pq  has  positive  bias.  In  fact  we  can 
compute  E[T]  exactly  (see  Exercise  19.9): 


E  [T]  =  E 


Note  that  n(l  —  e  ^  1,  so  that 

E  [T]  =  ^  e-^^  =  po 


as  n  goes  to  infinity.  Hence,  although  T  has  positive  bias,  the  bias  decreases 
to  zero  as  the  sample  size  becomes  larger.  In  Figure  19.4  the  expectation  of 
T  is  displayed  as  a  function  of  the  sample  size  n  for  the  case  =  In(lO).  For 
n  =  30  the  difference  between  E[r]  and  po  =  0.1  equals  0.0038. 
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Fig.  19.4.  E[r]  as  a  function  of  n. 


Quick  exercise  19.2  If  we  estimate  po  =  e~^  by  the  relative  frequency  of 
zeros  S  =  Y/n,  then  we  could  estimate  g  hy  U  =  —  In(S').  Argue  that  C/  is  a 
biased  estimator  for  g.  Is  the  bias  positive  or  negative? 

We  conclude  this  section  by  returning  to  the  estimation  of  the  parameter  g. 
Apart  from  the  (biased)  estimator  in  Quick  exercise  19.2  we  also  considered 


292 


19  Unbiased  estimators 


the  sample  mean  and  sample  variance  5^  as  possible  estimators  for  /i. 
These  are  both  unbiased  estimators  for  the  parameter  fj,.  This  is  a  direct 
consequence  of  a  more  general  property  of  and  which  is  discussed  in 
the  next  section. 


19.4  Unbiased  estimators  for  expectation  and  variance 

Sometimes  the  quantity  of  interest  can  be  described  by  the  expectation  or 
variance  of  the  model  distribution,  and  is  it  irrelevant  whether  this  distribution 
is  of  a  parametric  type.  In  this  section  we  propose  unbiased  estimators  for 
these  distribution  features. 

Unbiased  estimators  for  expectation  and  variance.  Sup¬ 
pose  Xi,  X2t  ■  ■ ,  X„  is  a  random  sample  from  a  distribution  with 
finite  expectation  and  finite  variance  cr^.  Then 


Xi  X2  -l-  •  •  •  -l-  X, 


n 


n 


is  an  unbiased  estimator  for  ^  and 


is  an  unbiased  estimator  for  a^. 

The  first  statement  says  that  E  [U„]  =  /r,  which  was  shown  in  Section  13.1. 
The  second  statement  says  E  [S'^]  =  cr^ .  To  see  this,  use  linearity  of  expecta¬ 
tions  to  write 


Since  E  [X„]  =  /r,  we  have  E  \Xi  —  X„]  =  E  [Xi]  —  E  [X„]  =  0.  Now  note  that 
for  any  random  variable  Y  with  E  [E]  =  0,  we  have 


Var(E)  =  E  [E^]  _  (e  [E])^  =  E  [E^]  . 


Applying  this  to  E  =  Xi  —  A„,  it  follows  that 


E  [(A,  -  A„)2]  =  Var(A,  -  A„)  . 


Note  that  we  can  write 
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Then  from  the  rules  concerning  variances  of  sums  of  independent  random 
variables  we  find  that 


Var(X,  -  X„)  =  Var  I 


n 

^2 


(n  —  1)^  n  —  1 


2  1  2 
(7  =  - a  . 


We  conclude  that 


E[5^] 


1  ” 

- -  VVar(X,-X„)  = 

n  —  1 


1 


n  —  1 


n 


^  1  2  2 
- (T  =  CT 

n 


This  explains  why  we  divide  by  n  —  1  in  the  formula  for  S^\  only  in  this  case 
5'^  is  an  unbiased  estimator  for  the  “true”  variance  cr^ .  If  we  would  divide  by 
n  instead  of  n  —  1,  we  would  obtain  an  estimator  with  negative  bias;  it  would 
systematically  produce  too-small  estimates  for  cr^. 

Quick  exercise  19.3  Consider  the  following  estimator  for  cr^: 


'I?  = 


Compute  the  bias  E  [Vj^]  —  for  this  estimator,  where  you  can  keep  compu¬ 
tations  simple  by  realizing  that  =  {n  —  l)5'^/n. 


Unbiasedness  does  not  always  carry  over 

We  have  seen  that  5'^  is  an  unbiased  estimator  for  the  “true”  variance  A 
natural  question  is  whether  Sn  is  again  an  unbiased  estimator  for  cr.  This  is  not 
the  case.  Since  the  function  g{x)  =  is  strictly  convex,  Jensen’s  inequality 
yields  that 

which  implies  that  E  <  a.  Another  example  is  the  network  arrivals,  in 
which  Xn  is  an  unbiased  estimator  for  fj,,  whereas  e“^”  is  positively  biased 
with  respect  to  e“^.  These  examples  illustrate  a  general  fact:  unbiasedness 
does  not  always  carry  over,  i.e.,  if  T  is  an  unbiased  estimator  for  a  parameter  9, 
then  g(T)  does  not  have  to  be  an  unbiased  estimator  for  g{9). 
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However,  there  is  one  special  case  in  which  unbiasedness  does  carry  over, 
namely  if  g{T)  =  aT  +  b.  Indeed,  if  T  is  unbiased  for  0:  E  [T]  =9,  then  by  the 
change-of-units  rule  for  expectations, 

E  [aT  +  b]  =  aE[T]  +  b  =  a9  +  6, 
which  means  that  aT  +  b  is  unbiased  for  a9  +  b. 


19.5  Solutions  to  the  quick  exercises 


19.1  Write  y  for  the  number  of  Xi  equal  to  zero.  Denote  the  probability  of 
zero  by  po,  so  that  po  =  e~^.  This  means  that  p  =  —  In(po)-  Hence  if  we 
estimate  po  by  the  relative  frequency  y/n,  we  can  estimate  phy  —  ln(y/n). 


19.2  The  function  g{x)  =  —  ln(x)  is  strictly  convex,  since  g"{x)  =  l/x^  >  0. 
Hence  by  Jensen’s  inequality 


E  [[/]  =  E  [-  In(S')]  >  -  ln(E  [S']). 

Since  we  have  seen  that  E  [S]  =  po  =  it  follows  that  E  [C/j  >  —  ln(E  [S])  = 
—  ln(e“^)  =  p.  This  means  that  U  has  positive  bias. 


19.3  Using  that  E  [S^]  =  cr^,  we  find  that 


E  [V^]  =  E 


n  —  I 


Si 


77,-1 


77—1 


E 


We  conclude  that  the  bias  of  equals  E  [V^]  —  cr^  =  <  0. 


19.6  Exercises 

19.1  ffl  Suppose  our  dataset  is  a  realization  of  a  random  sample  Xi,  X2, . . . ,  Xn 
from  a  uniform  distribution  on  the  interval  [—9,9],  where  9  is  unknown. 

a.  Show  that 

T  =  ^{Xl  +Xl  +  ---  +  Xl) 
is  an  unbiased  estimator  for  9^ . 

b.  Is  -v/T  also  an  unbiased  estimator  for  91  If  not,  argue  whether  it  has 
positive  or  negative  bias. 

19.2  Suppose  the  random  variables  Xi,X2,  ■  ■ .  ,Xn  have  the  same  expecta¬ 
tion  p. 
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a.  Is  S'  =  ^Xi  +  ^X2  +  ^Xs  an  unbiased  estimator  for  fil 

b.  Under  what  conditions  on  constants  oi,  02, . . . ,  a„  is 

T  =  +  02^2  +  •  ■  ■  +  0,nXn 

an  unbiased  estimator  for  /x? 

19.3  □  Suppose  the  random  variables  Xi,X2,  ■  ■  ■  ,Xn  have  the  same  expec¬ 
tation  /i.  For  which  constants  a  and  b  is 

T  =  ci(Xi  X2  Xn)  b 

an  unbiased  estimator  for  /x? 

19.4  Recall  Exercise  17.5  about  the  number  of  cycles  to  pregnancy.  Suppose 
the  dataset  corresponding  to  the  table  in  Exercise  17.5  a  is  modeled  as  a 
realization  of  a  random  sample  Xi,X2,  ■  ■  ■  ,Xn  from  a  Geo{p)  distribution, 
where  0  <  p  <  1  is  unknown.  Motivated  by  the  law  of  large  numbers,  a 
natural  estimator  for  p  is 

T  =  1/Xn. 

a.  Check  that  T  is  a  biased  estimator  for  p  and  find  out  whether  it  has 
positive  or  negative  bias. 

b.  In  Exercise  17.5  we  discussed  the  estimation  of  the  probability  that  a 
woman  becomes  pregnant  within  three  or  fewer  cycles.  One  possible  esti¬ 
mator  for  this  probability  is  the  relative  frequency  of  women  that  became 
pregnant  within  three  cycles 

^  number  of  <  3 

^  - . 

n 

Show  that  S  is  an  unbiased  estimator  for  this  probability. 

19.5  □  Suppose  a  dataset  is  modeled  as  a  realization  of  a  random  sample 
Xi,  X2,  ■  ■  ■ ,  Xn  from  an  Exp{\)  distribution,  where  A  >  0  is  unknown.  Let 
pL  denote  the  corresponding  expectation  and  let  denote  the  minimum  of 
Xi,  X2, . . . ,  X„.  Recall  from  Exercise  8.18  that  has  an  Exp{nX)  distribu¬ 
tion.  Find  out  for  which  constant  c  the  estimator 


T  =  cMn 


is  an  unbiased  estimator  for  /x. 

19.6  □  Consider  the  following  dataset  of  lifetimes  of  ball  bearings  in  hours. 


6278 

3113 

5236 

11584 

12628 

7725 

8604 

14266 

6125 

9350 

3212 

9003 

3523 

12888 

9460 

13431 

17809 

2812 

11825 

2398 

Source:  J.E.  Angus.  Goodness-of-fit  tests  for  exponentiality  based  on  a  loss- 
of-memory  type  functional  equation.  Journal  of  Statistical  Planning  and  In¬ 
ference,  6:241-251,  1982;  example  5  on  page  249. 
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One  is  interested  in  estimating  the  minimum  lifetime  of  this  type  of  ball  bear¬ 
ing.  The  dataset  is  modeled  as  a  realization  of  a  random  sample  Xi, . . . ,  Xn- 
Each  random  variable  Xi  is  represented  as 


X,  =  S  +  Y,, 


where  Vi  has  an  Exp{X)  distribution  and  <5  >  0  is  an  unknown  parameter  that 
is  supposed  to  model  the  minimum  lifetime.  The  objective  is  to  construct  an 
unbiased  estimator  for  S.  It  is  known  that 


E  [Mn]  —  5  H - r  and  E  \Xn\  —  <5  -I-  -r ) 

nX  A 


where  Mn  =  minimum  of  Xi,  X2, . . . ,  Xn  and  Xn  =  (Xi  +  X2  -b  •  •  •  -b  X„)/n. 

a.  Check  that 


is  an  unbiased  estimator  for  1/A. 

b.  Construct  an  unbiased  estimator  for  S. 

c.  Use  the  dataset  to  compute  an  estimate  for  the  minimum  lifetime  <5.  You 
may  use  that  the  average  lifetime  of  the  data  is  8563.5. 

19.7  Leaves  are  divided  into  four  different  types:  starchy-green,  sugary-white, 
starchy-white,  and  sugary-green.  According  to  genetic  theory,  the  types  occur 
with  probabilities  \{9  +  2),  jO,  -1(1  —  0),  and  |;(1  —  0),  respectively,  where 
0  <  0  <  1.  Suppose  one  has  n  leaves.  Then  the  number  of  starchy-green 
leaves  is  modeled  by  a  random  variable  A^i  with  a  Bin{n,pi)  distribution, 
where  pi  =  |(0  -b  2),  and  the  number  of  sugary- white  leaves  is  modeled  by 
a  random  variable  N2  with  a  Bin{n,p2)  distribution,  where  p2  =  jd-  The 
following  table  lists  the  counts  for  the  progeny  of  self-fertilized  heterozygotes 
among  3839  leaves. 


Type  Count 


Starchy-green  1997 

Sugary-white  32 

Starchy-white  906 

Sugary-green  904 


Source:  R.A.  Fisher.  Statistical  methods  for  research  workers.  Hafner,  New 
York,  1958;  Table  62  on  page  299. 


Consider  the  following  two  estimators  for  6: 


n  n 
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a.  Check  that  both  Ti  and  T2  are  unbiased  estimators  for  6. 

b.  Compute  the  value  of  both  estimators  for  0. 

19.8  ffl  Recall  the  black  cherry  trees  example  from  Exercise  17.9,  modeled  by 
a  linear  regression  model  without  intercept 


Yi  =  I3xi  +  Ui  for  i  =  1, 2, . . . ,  n, 


where  C/i,  C/2,  •  ■  ■ ,  are  independent  random  variables  with  E  \Ui]  =  0  and 
Var(C/i)  =  cr^.  We  discussed  three  estimators  for  the  parameter  /?: 


Bi 

B2 

B3 


Xi 

Yi  +  ■  ■  ■  +  Yn 


Xi  + 
xiYi 


Yu 

Xn 


Xn 

■  “t”  XjiYn 


Xt 


Xn 


Show  that  all  three  estimators  are  unbiased  for  (3. 


19.9  Consider  the  network  example  where  the  dataset  is  modeled  as  a  real¬ 
ization  of  a  random  sample  Xi,  X2, . . . ,  Xn  from  a  Pois{^)  distribution.  We 
estimate  the  probability  of  zero  arrivals  e“^  by  means  of  T  =  e“^”.  Check 
that 

E[T] 

Hint:  write  T  =  e“'^/",  where  Z  =  Xi  +  X2  -I-  •  •  •  -I-  Xn  has  a  Pois{nij) 
distribution. 


20 


Efficiency  and  mean  squared  error 


In  the  previous  chapter  we  introduced  the  notion  of  unbiasedness  as  a  de¬ 
sirable  property  of  an  estimator.  If  several  unbiased  estimators  for  the  same 
parameter  of  interest  exist,  we  need  a  criterion  for  comparison  of  these  estima¬ 
tors.  A  natural  criterion  is  some  measure  of  spread  of  the  estimators  around 
the  parameter  of  interest.  For  unbiased  estimators  we  will  use  variance.  For 
arbitrary  estimators  we  introduce  the  notion  of  mean  squared  error  (MSE), 
which  combines  variance  and  bias. 


20.1  Estimating  the  number  of  German  tanks 


In  this  section  we  come  back  to  the  problem  of  estimating  German  war  produc¬ 
tion  as  discussed  in  Section  1.5.  We  consider  serial  numbers  on  tanks,  recoded 
to  numbers  running  from  1  to  some  unknown  largest  number  N.  Given  is  a 
subset  of  n  numbers  of  this  set.  The  objective  is  to  estimate  the  total  number 
of  tanks  N  on  the  basis  of  the  observed  serial  numbers. 


Denote  the  observed  distinct  serial  numbers  by  xi,X2,  ■  ■  ■  ,Xn-  This  dataset 
can  be  modeled  as  a  realization  of  random  variables  Xi,  X2,  ■  ■  ■ ,  Xn  repre¬ 
senting  n  draws  without  replacement  from  the  numbers  1,  2, . . . ,  with  equal 
probability.  Note  that  in  this  example  our  dataset  is  not  a  realization  of  a 
random  sample,  because  the  random  variables  Ai,  A2, . . . ,  A„  are  dependent. 
We  propose  two  unbiased  estimators.  The  first  one  is  based  on  the  sample 
mean 


_  Xi  +  X2  -l-  •  •  •  -l-  A, 

-^n  — 


n 


and  the  second  one  is  based  on  the  sample  maximum 


M„  =  max{Ai,  A2, . . . ,  A„}. 
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An  estimator  based  on  the  sample  mean 

To  construct  an  unbiased  estimator  for  N  based  on  the  sample  mean,  we  start 
by  computing  the  expectation  of  Xn-  The  linearity-of-expectations  rule  also 
applies  to  dependent  random  variables,  so  that 


E  [A„]  = 


E[Ai]+E[A2]  +  ---+E[A„] 


In  Section  9.3  we  saw  that  the  marginal  distribution  of  each  Xi  is  the  same: 

F{X,  =  k)  =  ^  for  A:  =  1,2,...,  A. 

Therefore  the  expectation  of  each  Xi  is  given  by 

iA(A  +  l)  A+1 


It  follows  that 


E  [Xn]  = 


N 


E  [Ai]  +  E  [A2]  +  •  •  •  +  E  [A„]  _  A  +  1 


This  directly  implies  that 

Ti  =  2A„  -  1 

is  an  unbiased  estimator  for  A,  since  the  change-of-units  rule  yields  that 


E[Ti]  =  E  [2A„  -  1]  =  2E  [A„]  -1  =  2 


A+  1 


-  1  =  A. 


Quick  exercise  20.1  Suppose  we  have  observed  tanks  with  (recoded)  serial 
numbers 

61  19  56  24  16. 


Compute  the  value  of  the  estimator  Ti  for  the  total  number  of  tanks. 


An  estimator  based  on  the  sample  maximum 

To  construct  an  unbiased  estimator  for  A  based  on  the  maximum,  we  first 
compute  the  expectation  of  M„.  We  start  by  computing  the  probability  that 
Mn  =  k,  where  k  takes  the  values  n, . . .  ,N.  Similar  to  the  combinatorics 
used  in  Section  4.3  to  derive  the  binomial  distribution,  the  number  of  ways 
to  draw  n  numbers  without  replacement  from  1, 2, . . . ,  A  is  ('^) .  Hence  each 
combination  has  probability  1  /  (^) .  In  order  to  have  =  fc,  we  must  have 
one  number  equal  to  k  and  choose  the  other  n  —  1  numbers  out  of  the  numbers 
l,2,...,fc  —  1.  There  are  (^Z^)  ways  to  do  this.  Hence  for  the  possible  values 
A:  =  n,  n  +  1, . . . ,  A, 
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rfc-n 

P(M„  =  k)  =  ^ 


(fc  —  1)!  (iV  —  n)\n\ 


(^)  (fc-n)!(n-l)!  IV! 

(fc-1)!  (A^-n)! 


(fc  —  n)!  TV! 

Thus  the  expectation  of  Mn  is  given  by 

N  N 


k—n 


E[M„]  =  ^  fcP(M„  =  k)  =  Y,k 

k—n 
N 

=  E- 


k—n 


(fc  -  1)!  (N  -n)l 

77  - - 

(fc  — n)!  TV! 

fc!  {N-n)\ 

(fc  -  n)!  TVi 

N 


{N-ny.  ^  fc! 

TV!  (fc  —  n)! 

k—n 

How  to  continue  the  computation  of  E[Mn]?  We  use  a  trick:  we  start  by 
rearranging 


N 


N 


l  =  ^P(Tkr„=j)  =  ^: 


j=n 


j=n 


(j-l)!  (TV-n)! 
(j-n)!  TV! 


finding  that 


N 


V  (j-i)! 

^  a  -  TiV 


TV! 


(j  —  n)\  n  {N  —  n)! 


(20.1) 


This  holds  for  any  TV  and  any  n  <  TV.  In  particular  we  could  replace  TV  by 
TV  +  1  and  n  by  n  +  1: 


N +1  /  .  .... 

y-  (j  -  1)!  ^ 

{j  —  n.  —  11! 


(TV  +  1)! 


n 


1)!  {n+l){N-n)l' 


j=n+l 

Changing  the  summation  variable  to  fc  =  j  —  1,  we  obtain 

N 


E 

k—n 


fc! 


(A^+1)! 


(fc  — n)!  (n  +  1)(TV  —  n)! ' 


(20.2) 


This  is  exactly  what  we  need  to  finish  the  computation  of  E  [TVf„] .  Substituting 
(20.2)  in  what  we  obtained  earlier,  we  find 


N 


1  (TV  —  n)!  ^  fc! 
E[M„]=n-  E(fc3;y 

k—n 

{N-ny.  (A^  +  1)! 


TV!  (n  +  1)(TV  —  n)! 


TV  +  1 
n  +  1 


=  n  ■ 


=  n  ■ 
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Quick  exercise  20.2  Choosing  n  =  iV  in  this  formula  yields  Fi[Mn]  =  N. 
Can  you  argue  that  this  is  the  right  answer  without  doing  any  computations? 

With  the  formula  for  E  [M„]  we  can  derive  immediately  that 

n  +  1 


To  = 


-M„,  -  1 


is  an  unbiased  estimator  for  iV,  since  by  the  change-of-units  rule 
n  +  1 


E  [T2]  =  E 


-  1 


!i±iE  [MJ  -  1  =  _  1  = 

n  n  n  +  1 


Quick  exercise  20.3  Compute  the  value  of  estimator  T2  for  the  total  number 
of  tanks  on  basis  of  the  observed  numbers  from  Quick  exercise  20.1. 


20.2  Variance  of  an  estimator 

In  the  previous  section  we  saw  that  we  can  construct  two  completely  different 
estimators  for  the  total  number  of  tanks  N  that  are  both  unbiased.  The  obvious 
question  is:  which  of  the  two  is  better?  To  answer  this  question,  we  investigate 
how  both  estimators  vary  around  the  parameter  of  interest  N.  Although  we 
could  in  principle  compute  the  distributions  of  Ti  and  T2,  we  carry  out  a 
small  simulation  study  instead.  Take  N  =  1000  and  n  =  10  fixed.  We  draw 
10  numbers,  without  replacement,  from  1,2,...,  1000  and  compute  the  value 
of  the  estimators  Ti  and  T2.  We  repeat  this  two  thousand  times,  so  that  we 
have  2000  values  for  both  estimators.  In  Figure  20.1  we  have  displayed  the 
histogram  of  the  2000  values  for  Ti  on  the  left  and  the  histogram  of  the  2000 
values  for  T2  on  the  right.  From  the  histograms,  which  reflect  the  probability 


300  700  N  =  1000  1300  1600  300  700  N  =  1000  1300  1600 

Fig.  20.1.  Histograms  of  two  thousand  values  for  Ti  (left)  and  To  (right). 
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mass  functions  of  both  estimators,  we  see  that  the  distributions  of  Ti  and 
T2  are  of  completely  different  types.  As  can  be  expected  from  the  fact  that 
both  estimators  are  unbiased,  the  values  vary  around  the  parameter  of  interest 
N  =  1000.  The  most  important  difference  between  the  histograms  is  that  the 
variation  in  the  values  of  T2  is  less  than  the  variation  in  the  values  of  Ti. 
This  suggests  that  estimator  T2  estimates  the  total  number  of  tanks  more 
efficiently  than  estimator  Ti,  in  the  sense  that  it  produces  estimates  that 
are  more  concentrated  around  the  parameter  of  interest  N  than  estimates 
produced  by  Ti.  Recall  that  the  variance  measures  the  spread  of  a  random 
variable.  Hence  the  previous  discussion  motivates  the  use  of  the  variance  of 
an  estimator  to  evaluate  its  performance. 


Efficiency.  Let  Ti  and  T2  be  two  unbiased  estimators  for  the  same 
parameter  6.  Then  estimator  T2  is  called  more  efficient  than  estima¬ 
tor  Ti  if  Var(T2)  <  Var(Ti),  irrespective  of  the  value  of  9. 


Let  us  compare  Ti  and  T2  using  this  criterion.  For  Ti  we  have 
Var(Ti)  =  Var(2A„  -  l)  =  4Var(A„)  . 


Although  the  Xi  are  not  independent,  it  is  true  that  all  pairs  {Xi,Xj)  with 
i  ^  j  have  the  same  distribution  (this  follows  in  the  same  way  in  which 
we  showed  on  page  122  that  all  Xi  have  the  same  distribution).  With  the 
variance-of-the-sum  rule  for  n  random  variables  (see  Exercise  10.17),  we  find 
that 

Var(Ai  -I-  •  •  •  -I-  Xn)  =  nVar(Ai)  -|-  n{n  —  l)Cov(Ai,  A2) . 

In  Exercises  9.18  and  10.18,  we  computed  that 

Var(Ai)  =  1(1V-  1)(1V+  1),  Cov(Ai,X2)  =  -^(iV+  1). 


We  find  therefore  that 


Var(Ti) 


4Var(X„)  =  — Var(Ai  +  •  •  •  +  A„) 


n  ■  —  l)(iV  -I-  1)  —  n{n  —  1)  •  +  1) 


12 


12' 


±(iV+l)[iV-l-(n-l)] 

{N+l){N-n) 

3n 


Obtaining  the  variance  of  T2  is  a  little  more  work.  One  can  compute  the 
variance  of  in  a  way  that  is  very  similar  to  the  way  we  obtained  E  [M„] . 
The  result  is  (see  Remark  20.1  for  details) 


n{N  +  1){N  —  n) 
{n  +  2)(n  -I-  1)2 


Var(M„) 
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Remark  20.1  (How  to  compute  this  variance).  The  trick  is  to  com¬ 
pute  not  E  [M^]  but  E  -I-  1)].  First  we  derive  an  identity  from  Equa¬ 

tion  (20.1)  as  before,  this  time  replacing  by  -|-  2  and  n  by  n  -|-  2: 


iV-l-2  ,, 

■sp  0  -  1)!  ^ 

^  i  —  n  — 


(A^-f  2)! 


2)'  {n  +  2){N  -ny: 
Changing  the  summation  variable  to  k  =  j  —  2  yields 
(A: -hi)!  _  (AT -h  2)1 


■Sp  VK-I-  Ij’  ^ 

^  (  h  —  rt  \  I 


{k  —  n)\  {n  +  2){N  —  ny. 
With  this  formula  one  can  obtain: 

N 


+ 1)1  ^  E  + •)  ■  -fef  ^ 


k  =  r} 


Since  we  know  E  [M„],  we  can  determine  E  [M^]  from  this,  and  subsequently 
the  variance  of  M„. 

With  the  expression  for  the  variance  of  M„,  we  derive 


Var(T2)  =  Var 


n+1 

n 


{n  -h  1)^ 


Var(M„) 


{N+l){N-n) 
n{n  +  2) 


We  see  that  Var(T2)  <  Var(Ti)  for  all  N  and  n  >  2.  Hence  T2  is  always  more 
efficient  than  Ti,  except  when  n  =  1.  In  this  case  the  variances  are  equal, 
simply  because  the  estimators  are  the  same — they  both  equal  Xi. 

The  quotient  Var(ri)  /Var(T2),  is  called  the  relative  efficiency  of  T2  with 
respect  to  Ti.  In  our  case  the  relative  efficiency  of  T2  with  respect  to  Ti 
equals 

Var(Ti)  {N+l){N-n)  n{n  +  2)  n  +  2 

Var(T2)  “  3^^  (Af -h  l)(Af  -  n)  “  3 

Surprisingly,  this  quotient  does  not  depend  on  N,  and  we  see  clearly  the 
advantage  of  T2  over  Ti  as  the  sample  size  n  gets  larger. 

Quick  exercise  20.4  Let  n  =  5,  and  let  the  sample  be 


7  3  10  45  15. 


Compute  the  value  of  the  estimator  Ti  for  N.  Do  you  notice  anything  strange? 

The  self-contradictory  behavior  of  Ti  in  Quick  exercise  20.4  is  not  rare:  this 
phenomenon  will  occur  for  up  to  50%  of  the  samples  if  n  and  N  are  large. 
This  gives  another  reason  to  prefer  T2  over  Ti. 
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Remark  20.2  (The  Cramer-Rao  inequality).  Suppose  we  have  a  ran¬ 
dom  sample  from  a  continuous  distribution  with  probability  density  function 
fe,  where  6  is  the  parameter  of  interest.  Under  certain  smoothness  condi¬ 
tions  on  the  density  fe,  the  variance  of  an  unbiased  estimator  T  for  0  always 
has  to  be  larger  than  or  equal  to  a  certain  positive  number,  the  so-called 
Cramer-Rao  lower  bound; 

Var(r)  >  - = - - - -  for  all  6. 

nE[(^ln/,(X))"] 

Here  n  is  the  size  of  the  sample  and  X  a  random  variable  whose  density 
function  is  fe-  In  some  cases  we  can  hnd  unbiased  estimators  attaining  this 
bound.  These  are  called  minimum  variance  unbiased  estimators.  An  exam¬ 
ple  is  the  sample  mean  for  the  expectation  of  an  exponential  distribution. 
(We  will  consider  this  case  in  Exercise  20.3.) 


20.3  Mean  squared  error 

In  the  last  section  we  compared  two  unbiased  estimators  by  considering  their 
spread  around  the  value  to  be  estimated,  where  the  spread  was  measured  by 
the  variance.  Although  unbiasedness  is  a  desirable  property,  the  performance 
of  an  estimator  should  mainly  be  judged  by  the  way  it  spreads  around  the 
parameter  6  to  be  estimated.  This  leads  to  the  following  definition. 


Definition.  Let  T  be  an  estimator  for  a  parameter  0.  The  mean 
squared  error  of  T  is  the  number  MSE(T)  =  E  [(T  —  0)^]  . 


According  to  this  criterion,  an  estimator  Ti  performs  better  than  an  estima¬ 
tor  T2  if  MSE(Ti)  <  MSE(T2).  Note  that 

MSE(T)  =  E[(T-0)2] 

=  E[{T -E[T]  +  E[T]- 0f] 

=  E  [(T  -  E  [r])2]  -b  2E  [T  -  E  [T]]  (E  [T]  -  0)  +  (E  [T]  -  0f 
=  Var(T)  -b  (E[r]  -  0f. 

So  the  MSE  of  T  turns  out  to  be  the  variance  of  T  plus  the  square  of  the  bias 
of  T.  In  particular,  when  T  is  unbiased,  the  MSE  of  T  is  just  the  variance 
of  T.  This  means  that  we  already  used  mean  squared  errors  to  compare  the 
estimators  Ti  and  T2  in  the  previous  section.  We  extend  the  notion  of  efficiency 
by  saying  that  estimator  T2  is  more  ejjieient  than  estimator  Ti  (for  the  same 
parameter  of  interest),  if  the  MSE  of  T2  is  smaller  than  the  MSE  of  Ti. 

Unbiasedness  and  efficiency 

A  biased  estimator  with  a  small  variance  may  be  more  useful  than  an  unbiased 
estimator  with  a  large  variance.  We  illustrate  this  with  the  network  server 
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0  e”''  0.2  0.3  0.4  0  e”''  0.2  0.3  0.4 

Fig.  20.2.  Histograms  of  a  thousand  values  for  S  (left)  and  T  (right). 


example  from  Section  19.2.  Recall  that  our  goal  was  to  estimate  the  probability 
Po  =  of  zero  arrivals  (of  packages)  in  a  minute.  We  did  have  two  promising 
candidates  as  estimators: 

number  of  W  equal  to  zero  ,  „  -x 

b  = -  and  1  =  e 

n 

In  Figure  20.2  we  depict  histograms  of  one  thousand  simulations  of  the  values 
of  S  and  T  computed  for  random  samples  of  size  n  =  25  from  a  Pois{fi) 
distribution,  where  /r  =  2.  Considering  the  way  the  values  of  the  (biased!) 
estimator  T  are  more  concentrated  around  the  true  value  e~^  =  e~^  =  0.1353, 
we  would  be  inclined  to  prefer  T  over  S.  This  choice  is  strongly  supported 
by  the  fact  that  T  is  more  efficient  than  S:  MSE(T)  is  always  smaller  than 
MSE(S'),  as  illustrated  in  Figure  20.3. 


Fig.  20.3.  MSEs  of  S  and  T  as  a  function  of  /r. 
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20.4  Solutions  to  the  quick  exercises 

20.1  We  have  =  (61  +  19  +  56  +  24  +  16)/5  =  176/5  =  35.2.  Therefore 
ti  =  2-35.2-  1  =  69.4. 

20.2  When  n  =  N,  we  have  drawn  all  the  numbers.  But  then  the  largest 
number  Mj^  is  N,  and  so  E  [Mj^]  =  N. 

20.3  We  have  fa  =  (6/5)  •  61  -  1  =  72.2. 

20.4  Since  45  is  in  the  sample,  N  has  to  be  at  least  45.  Adding  the  numbers 
yields  7  +  3  +  10  +  15  +  45  =  80.  So  =  2x„  -  1  =  2  •  16  -  1  =  31.  What  is 
strange  about  this  is  that  the  estimate  for  N  is  far  smaller  than  the  number 
45  in  the  sample! 


20.5  Exercises 

20.1  Given  is  a  random  sample  Ai,  X2,  ■  ■  ■ ,  A„  from  a  distribution  with  finite 
variance  .  We  estimate  the  expectation  of  the  distribution  with  the  sample 
mean  X„.  Argue  that  the  larger  our  sample,  the  more  efficient  our  estimator. 
What  is  the  relative  efficiency  Var(A„)  /Var(A2n)  of  X2n  with  respect  to  A„? 

20.2  ffl  Given  are  two  estimators  S  and  T  for  a  parameter  9.  Furthermore  it 
is  known  that  Var(S')  =  40  and  Var(T)  =  4. 

a.  Suppose  that  we  know  that  E  [S']  =  0  and  E  [T]  =  0  +  3.  Which  estimator 
would  you  prefer,  and  why? 

b.  Suppose  that  we  know  that  E  [S]  =0  and  E  [T]  =  0  +  a  for  some  positive 
number  a.  For  each  a,  which  estimator  would  you  prefer,  and  why? 

20.3  ffl  Suppose  we  have  a  random  sample  Xi, . . . ,  A„  from  an  Exp{X)  distri¬ 
bution.  Suppose  we  want  to  estimate  the  mean  1/A.  According  to  Section  19.4 
the  estimator 

Ti  =  Xn  =  —  (Ai  +  X2  +  ■  •  •  +  Xn) 
n 

is  an  unbiased  estimator  of  1/A.  Let  Mn  be  the  minimum  of  Ai,  A2, . . . ,  A„. 
Recall  from  Exercise  8.18  that  M„  has  an  Exp{nX)  distribution.  In  Exer¬ 
cise  19.5  you  have  determined  that 


T2  =  nMn 

is  another  unbiased  estimator  for  1/A.  Which  of  the  estimators  Ti  and  T2 
would  you  choose  for  estimating  the  mean  1/A?  Substantiate  your  answer. 
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20.4  □  Consider  the  situation  of  this  chapter,  where  we  have  to  estimate  the 
parameter  N  from  a  sample  Xi, . . . ,  Xn  drawn  without  replacement  from  the 
numbers  {1, . . . ,  A^}.  To  keep  it  simple,  we  consider  n  =  2.  Let  M  =  M2  be 
the  maximum  of  Xi  and  X2.  We  have  found  that  T2  =  3M/2  —  1  is  a  good 
unbiased  estimator  for  N.  We  want  to  construct  a  new  unbiased  estimator 
T3  based  on  the  minimum  L  of  Xi  and  X2.  In  the  following  you  may  use 
that  the  random  variable  L  has  the  same  distribution  as  the  random  variable 
iV  +  1  —  M  (this  follows  from  symmetry  considerations). 

a.  Show  that  T3  =  3L  —  1  is  an  unbiased  estimator  for  N . 

b.  Compute  Var(T3)  using  that  Var(M)  =  (iV  +  1)(A^  —  2)/18.  (The  latter 
has  been  computed  in  Remark  20.1.) 

c.  What  is  the  relative  efficiency  of  T2  with  respect  to  T3? 


20.5  Someone  is  proposing  two  unbiased  estimators  U  and  V ,  with  the  same 
variance  Var(C/)  =  Var(R).  It  therefore  appears  that  we  would  not  prefer  one 
estimator  over  the  other.  However,  we  could  go  for  a  third  estimator,  namely 
W  =  {U  +  V)/2.  Note  that  W  is  unbiased.  To  judge  the  quality  of  W  we 
want  to  compute  its  variance.  Lacking  information  on  the  joint  probability 
distribution  of  U  and  V,  this  is  impossible.  However,  we  should  prefer  W  in 
any  case!  To  see  this,  show  by  means  of  the  variance-of-the-sum  rule  that  the 
relative  efficiency  of  U  with  respect  to  W  is  equal  to 


Var((U  +  V)/2) 
Var(t7) 


Here  p(U,  V)  is  the  correlation  coefficient.  Why  does  this  result  imply  that  we 
should  use  W  instead  of  17  (or  V)1 


20.6  A  geodesic  engineer  measures  the  three  unknown  angles  01,0:2,  and  03 
of  a  triangle.  He  models  the  uncertainty  in  the  measurements  by  considering 
them  as  realizations  of  three  independent  random  variables  Ti,T2,  and  T3 
with  expectations 


E[Ti]— oi,  E[T2]— 02,  E[T3]  — 03, 


and  all  three  with  the  same  variance  cr^ .  In  order  to  make  use  of  the  fact  that 
the  three  angles  must  add  to  tt,  he  also  considers  new  estimators  t/i,  C/2,  and 
C/3  defined  by 


C/i  =Ti  +  i(7r-Ti-T2-r3), 
C/2  =T2  +  i(7r  -  Ti  -  T2  -  Ta), 
C/3  =^3  +  i(7r  -  Ti  -  T2  -  Ta). 


(Note  that  the  “deviation”  tt  —  Ti  —  T2  —  T^,  is  evenly  divided  over  the  three 
measurements  and  that  C/i  +  C/2  +  C/3  =  tt.) 
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a.  Compute  E[C/i]  and  Var(C/i) . 

b.  What  does  he  gain  in  efficiency  when  he  uses  Ui  instead  of  Ti  to  estimate 
the  angle  oi? 

c.  What  kind  of  estimator  would  you  choose  for  ai  if  it  is  known  that  the 
triangle  is  isosceles  (i.e.,  ai  =  02)? 

20.7  □  (Exercise  19.7  continued.)  Leaves  are  divided  into  four  different  types: 
starchy-green,  sugary-white,  starchy-white,  and  sugary-green.  According  to 
genetic  theory,  the  types  occur  with  probabilities  ^(0  +  2),  ^0,  j{l  —  0),  and 
j{l  —  0),  respectively,  where  0  <  0  <  1.  Suppose  one  has  n  leaves.  Then  the 
number  of  starchy-green  leaves  is  modeled  by  a  random  variable  with  a 
Bin{n,pi)  distribution,  where  pi  =  ^(0-1-2),  and  the  number  of  sugary-white 
leaves  is  modeled  by  a  random  variable  N2  with  a  Bin{n,p2)  distribution, 
where  p2  =  j0.  Consider  the  following  two  estimators  for  0: 

4  4 

Ti  =  — —  2  and  T2  —  — fV2- 
n  n 

In  Exercise  19.7  you  showed  that  both  Ti  and  T2  are  unbiased  estimators 
for  0.  Which  estimator  would  you  prefer?  Motivate  your  answer. 

20.8  ffl  Let  Xn  and  Ym  be  the  sample  means  of  two  independent  random 
samples  of  size  n  (resp.  m)  from  the  same  distribution  with  mean  p.  We 
combine  these  two  estimators  to  a  new  estimator  T  by  putting 

T=rX„  +  (l-r)y^, 

where  r  is  some  number  between  0  and  1. 

a.  Show  that  T  is  an  unbiased  estimator  for  the  mean  p. 

b.  Show  that  T  is  most  efficient  when  r  =  nj{n  +  m). 

20.9  Given  is  a  random  sample  Ai, ^2, . . . ,  A„  from  a  Ber{p)  distribution. 
One  considers  the  estimators 

Ti  =  -  (Ai -I - l-A„)  and  r2  =  min{Ai, . . . ,  A„}. 

n 

a.  Are  Tf  and  T2  unbiased  estimators  for  pi 

b.  Show  that 

MSE(Ti)  =  -p{l  -  p),  MSE(T2)  =  p"  -  2p"+i  +  p^. 
n 

c.  Which  estimator  is  more  efficient  when  n  =  21 

20.10  Suppose  we  have  a  random  sample  Ai, . . . ,  A„  from  an  Exp{\)  distri¬ 
bution.  We  want  to  estimate  the  expectation  1/A.  According  to  Section  19.4, 
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Xn  —  —  {Xi  +  X2  +  •  •  ■  +  Xn) 
n 

is  an  unbiased  estimator  of  1/A.  Let  us  consider  more  generally  estimators  T 
of  the  form 

T  =  c-{Xi+X2  +  ---  +  Xn), 

where  c  is  a  real  number.  We  are  interested  in  the  MSE  of  these  estimators 
and  would  like  to  know  whether  there  are  choices  for  c  that  yield  a  smaller 
MSE  than  the  choice  c  =  1/n. 

a.  Compute  MSE(T)  for  each  c. 

b.  For  which  c  does  the  estimator  perform  best  in  the  MSE  sense?  Compare 
this  to  the  unbiased  estimator  that  one  obtains  for  c  =  1/n. 


20.11  □  In  Exercise  17.9  we  modeled  diameters  of  black  cherry  trees  with  the 
linear  regression  model  (without  intercept) 

Yi  =  (3x1  +  Ui 


for  i  =  l,2,...,n.  As  usual,  the  Ui  here  are  independent  random  variables 
with  E[t/i]=0,  and  Var(C/i)  =  cr^. 

We  considered  three  estimators  for  the  slope  j3  of  the  line  y  =  [3x:  the  so- 
called  least  squares  estimator  Ti  (which  will  be  considered  in  Chapter  22), 
the  average  slope  estimator  T2,  and  the  slope  of  the  averages  estimator  T3. 
These  estimators  are  defined  by: 


i=l  i=l 


In  Exercise  19.8  it  was  shown  that  all  three  estimators  are  unbiased.  Compute 
the  MSE  of  all  three  estimators. 

Remark:  it  can  be  shown  that  Ti  is  always  more  efficient  than  T3,  which  in 
turn  is  more  efficient  than  T2 .  To  prove  the  first  inequality  one  uses  a  famous 
inequality  called  the  Cauchy  Schwartz  inequality;  for  the  second  inequality 
one  uses  Jensen’s  inequality  (can  you  see  how?). 


20.12  Let  Xi,  X2,  ■  ■  ■ ,  Xn  represent  n  draws  without  replacement  from  the 
numbers  1,2,...,A^  with  equal  probability.  The  goal  of  this  exercise  is  to 
compute  the  distribution  of  M„  in  a  way  other  than  by  the  combinatorial 
analysis  we  did  in  this  chapter. 

a.  Compute  P(M„  <  k),  by  using,  as  in  Section  8.4,  that: 

P(M„  <k)=  P{Xi  <k,X2<k,...,Xn<k). 
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b.  Derive  that 


P(M„  =  n) 


n\{N  —  n)\ 

Wi 


c.  Show  that  for  fc  =  n  +  1, . . . ,  iV 


P(M„ 


k)  =  n  ■ 


{k-iy.  {N-ny. 
(fc-n)!  m 
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Maximum  likelihood 


In  previous  chapters  we  could  easily  construct  estimators  for  various  param¬ 
eters  of  interest  because  these  parameters  had  a  natural  sample  analogue: 
expectation  versus  sample  mean,  probabilities  versus  relative  frequencies,  etc. 
However,  in  some  situations  such  an  analogue  does  not  exist.  In  this  chap¬ 
ter,  a  general  principle  to  construct  estimators  is  introduced,  the  so-called 
maximum  likelihood  principle.  Maximum  likelihood  estimators  have  certain 
attractive  properties  that  are  discussed  in  the  last  section. 


21.1  Why  a  general  principle? 

In  Section  4.4  we  modeled  the  number  of  cycles  up  to  pregnancy  by  a  ran¬ 
dom  variable  X  with  a  geometric  distribution  with  (unknown)  parameter  p. 
Weinberg  and  Gladen  studied  the  effect  of  smoking  on  the  number  of  cycles 
and  obtained  the  data  in  Table  21.1  for  100  smokers  and  486  nonsmokers. 


Table  21.1.  Observed  numbers  of  cycles  up  to  pregnancy. 


Number  of  cycles 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

>12 

Smokers 

29 

16 

17 

4 

3 

9 

4 

5 

1 

1 

1 

3 

7 

Nonsmokers 

198 

107 

55 

38 

18 

22 

7 

9 

5 

3 

6 

6 

12 

Source:  C.R.  Weinberg  and  B.C.  Gladen.  The  beta-geometric  distribution  ap¬ 
plied  to  comparative  fecundability  studies.  Biometrics,  42(3) :547— 560,  1986. 


Is  the  parameter  p,  which  equals  the  probability  of  becoming  pregnant  after 
one  cycle,  different  for  smokers  and  nonsmokers?  Let  us  try  to  find  out  by 
estimating  p  in  the  two  cases. 
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What  would  be  reasonable  ways  to  estimate  p?  Since  p  =  P(X  =  1),  the  law 
of  large  numbers  (see  Section  13.3)  motivates  use  of 

^  number  of  Xi  equal  to  1 

o  - 

n 

as  an  estimator  for  p.  This  yields  estimates  p  =  29/100  =  0.29  for  smokers  and 
p  =  198/486  =  0.41  for  nonsmokers.  We  know  from  Section  19.4  that  S  is  an 
unbiased  estimator  for  p.  However,  one  cannot  escape  the  feeling  that  S'  is  a 
“bad”  estimator:  S  does  not  use  all  the  information  in  the  table,  i.e.,  the  way 
the  women  are  distributed  over  the  numbers  2,3,...  of  observed  numbers  of 
cycles  is  not  used.  One  would  like  to  have  an  estimator  that  incorporates  all 
the  available  information.  Due  to  the  way  the  data  are  given,  this  seems  to  be 
difficult.  For  instance,  estimators  based  on  the  average  cannot  be  evaluated, 
because  7  smokers  and  12  nonsmokers  had  an  unknown  number  of  cycles 
up  to  pregnancy  (larger  than  12).  If  one  simply  ignores  the  last  column  in 
Table  21.1  as  we  did  in  Exercise  17.5,  the  average  can  be  computed  and  yields 
1/^93  =  0.28  09  as  an  estimate  of  p  for  smokers  and  1/7:474  =  0.3688  for 
nonsmokers.  However,  because  we  discard  seven  values  larger  than  12  in  case 
of  the  smokers  and  twelve  values  larger  than  12  in  case  of  the  nonsmokers,  we 
overestimate  p  in  both  cases. 

In  the  next  section  we  introduce  a  general  principle  to  find  an  estimate  for  a 
parameter  of  interest,  the  maximum  likelihood  principle.  This  principle  yields 
good  estimators  and  will  solve  problems  such  as  those  stated  earlier. 


21.2  The  maximum  likelihood  principle 

Suppose  a  dealer  of  computer  chips  is  offered  on  the  black  market  two  batches 
of  10  000  chips  each.  According  to  the  seller,  in  one  batch  about  50%  of  the 
chips  are  defective,  while  this  percentage  is  about  10%  in  the  other  batch.  Our 
dealer  is  only  interested  in  this  last  batch.  Unfortunately  the  seller  cannot  tell 
the  two  batches  apart.  To  help  him  to  make  up  his  mind,  the  seller  offers  our 
dealer  one  batch,  from  which  he  is  allowed  to  select  and  test  10  chips.  After 
selecting  10  chips  arbitrarily,  it  turns  out  that  only  the  second  one  is  defective. 
Our  dealer  at  once  decides  to  buy  this  batch.  Is  this  a  wise  decision? 

With  the  batch  where  50%  of  the  chips  are  defective  it  is  more  likely  that 
defective  chips  will  appear,  whereas  with  the  other  batch  one  would  expect 
hardly  any  defective  chip.  Clearly,  our  dealer  chooses  the  batch  for  which  it  is 
most  likely  that  only  one  chip  is  defective.  This  is  also  the  guiding  idea  behind 
the  maximum  likelihood  principle. 


The  maximum  likelihood  principle.  Given  a  dataset,  choose 
the  parameter(s)  of  interest  in  such  a  way  that  the  data  are  most 
likely. 
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Set  =  1  in  case  the  ith.  tested  chip  was  defective  and  =  0  in  case  it 
was  operational,  where  i  =  1, . . . ,  10.  Then  Ri, . . . ,  Riq  are  ten  independent 
Ber(p)  distributed  random  variables,  where  p  is  the  probability  that  a  ran¬ 
domly  selected  chip  is  defective.  The  probability  that  the  observed  data  occur 
is  equal  to 


P(i?i  =  0,  i?2  =  1,  =  0, . . . ,  i?io  =  0)  =  p(l  -  p)®. 


For  the  batch  where  about  10%  of  the  chips  are  defective  we  find  that 


P(i?i  —  0,  i?2  —  1,  -^3  —  Oj  •  ■  ■  )  ^10  —  0)  — 


whereas  for  the  other  batch 


=oi=iriV= 


10  Vio 


=  0.039, 


P(i?i  =  0,  i?2  =  1,  i?3  =  0, . . . ,  i?io  =  0)  =  Vi  )  =  0.00098. 


So  the  probability  for  the  batch  with  only  10%  defective  chips  is  about  40 
times  larger  than  the  probability  for  the  other  batch.  Given  the  data,  our 
dealer  made  a  sound  decision. 


Quick  exercise  21.1  Which  batch  should  the  dealer  choose  if  only  the  first 
three  chips  are  defective? 

Returning  to  the  example  of  the  number  of  cycles  up  to  pregnancy,  denoting 
Xi  as  the  number  of  cycles  up  to  pregnancy  of  the  ith  smoker,  recall  that 

F{X,  =  k)  =  {l-p)>^-^p 


and 

F{Xi  >  12)  =  P(no  success  in  cycle  1  to  12)  =  (1  —  p)^^; 

cf.  Quick  exercise  4.6.  From  Table  21.1  we  see  that  there  are  29  smokers  for 
which  Xi  =  1,  that  there  are  16  for  which  Xi  =  2,  etc.  Since  we  model  the 
data  as  a  random  sample  from  a  geometric  distribution,  the  probability  of  the 
data — as  a  function  of  p — is  given  by 

L{p)  =  C  ■  P(X,  =  1)2®  •  P(X,  =  2)^®  •  •  •  P(X,  =  12)^  •  P(X,  >  12)^ 

=  c .  .  ((1  -  p)pr  •  •  •  ((1  -  P^pf  ■  ((1  -  P^y 

=  C'-p®3-(l-p)32^. 


Here  C  is  the  number  of  ways  we  can  assign  29  ones,  16  twos,  . . . ,  3  twelves, 
and  7  numbers  larger  than  12  to  100  smokers.^  According  to  the  maximum 
likelihood  principle  we  now  choose  p,  with  0  <  p  <  1,  in  such  a  way,  that  L{p) 


C  =  311657028822819441451842682167854800096263625208359116504431153487280760832000000000. 
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is  maximal.  Since  C  does  not  depend  on  p,  we  do  not  need  to  know  the  value 
of  C  explicitly  to  find  for  which  p  the  function  L(j))  is  maximal. 

Differentiating  L(p)  with  respect  to  p  yields  that 

L'{p)  =  C  [93/2(1  -  p)322  _  322/3(1  -  p)32i] 

=  (7/2(1  -  /32i  [93(1  _  p)  _  322p] 

=  (7/2(1 -/32i(93_4i5p)^ 

Now  L'{p)  =  0  if  p  =  0,  p  =  1,  or  p  =  93/415  =  0.224,  and  L{p)  attains  its 
unique  maximum  in  this  last  point  (check  this!).  We  say  that  93/415  =  0.224  is 
the  maximum  likelihood  estimate  of  p  for  the  smokers.  Note  that  this  estimate 
is  quite  a  lot  smaller  than  the  estimate  0.29  for  the  smokers  we  found  in  the 
previous  section,  and  the  estimate  0.2809  you  obtained  in  Exercise  17.5. 

Quick  exercise  21.2  Check  that  for  the  nonsmokers  the  probability  of  the 
data  is  given  by 

L(p)  =  constant  •  p^^'^(l  —  p)®®®. 

Compute  the  maximum  likelihood  estimate  for  p. 


Remark  21.1  (Some  history).  The  method  of  maximum  likelihood  es¬ 
timation  was  propounded  by  Ronald  Aylmer  Fisher  in  a  highly  influential 
paper.  In  fact,  this  paper  does  not  contain  the  original  statement  of  the 
method,  which  was  published  by  Fisher  in  1912  [9],  nor  does  it  contain 
the  original  definition  of  likelihood,  which  appeared  in  1921  (see  [10]).  The 
roots  of  the  maximum  likelihood  method  date  back  as  far  as  1713,  when 
Jacob  Bernoulli’s  Ars  Conjectandi  ([1])  was  posthumously  published.  In  the 
eighteenth  century  other  important  contributions  were  by  Daniel  Bernoulli, 
Lambert,  and  Lagrange  (see  also  [2],  [16],  and  [17]).  It  is  interesting  to  re¬ 
mark  that  another  giant  of  statistics,  Karl  Pearson,  had  not  understood 
Fisher’s  method.  Fisher  was  hurt  by  Pearson’s  lack  of  understanding,  which 
eventually  led  to  a  violent  confrontation. 


21.3  Likelihood  and  loglikelihood 

Suppose  we  have  a  dataset  xi,X2,  ■  ■  ■  ,Xn,  modeled  as  a  realization  of  a  random 
sample  from  a  distribution  characterized  by  a  parameter  9.  To  stress  the 
dependence  of  the  distribution  on  9,  we  write 

Pe{x) 

for  the  probability  mass  function  in  case  we  have  a  sample  from  a  discrete 
distribution  and 


21.3  Likelihood  and  loglikeliliood  317 


for  the  probability  density  function  when  we  have  a  sample  from  a  continuous 
distribution. 

For  a  dataset  xi,  X2,  ■  ■  ■ ,  Xn  modeled  as  the  realization  of  a  random  sample 
Xi, . . . ,  Xn  from  a  discrete  distribution,  the  maximum  likelihood  principle 
now  tells  us  to  estimate  9  by  that  value,  for  which  the  function  L{9),  given  by 

L{e)  =  P(Xi  =  xi, . . .  ,X„  =  Xn)  =pe{xi)  ■  ■  -peixn) 

is  maximal.  This  value  is  called  the  maximum  likelihood  estimate  of  9.  The 
function  L{9)  is  called  the  likelihood  function.  This  is  a  function  of  9,  deter¬ 
mined  by  the  numbers  xi,  3:2, . . . ,  Xn- 

In  case  the  sample  is  from  a  continuous  distribution  we  clearly  need  to  de¬ 
fine  the  likelihood  function  L{9)  in  a  way  different  from  the  discrete  case  (if 
we  would  define  L{9)  as  in  the  discrete  case,  one  always  would  have  that 
L{9)  =  0).  For  a  reasonable  definition  of  the  likelihood  function  we  have  the 
following  motivation.  Let  fg  be  the  probability  density  function  of  X,  and 
let  e  >  0  be  some  fixed,  small  number.  It  is  sensible  to  choose  9  in  such  a 
way,  that  the  probability  P(xi  —  e  <  Xi  <  Xi  +  e, . . . ,  Xn  —  s  <  Xn  <  Xn  +  e) 
is  maximal.  Since  the  Xi  are  independent,  we  find  that 

P(xi  —  e  <  Xi  <  Xi  +  e,  .  .  .  ,  Xn  —  £  ^  Xn  <  Xn  +  s) 

=  P(xi  -  £  <  Xi  <  Xi  +  e)  ■  ■  ■  P(Xn  -  £  <  Xn  <  Xn  +  £)  (21.1) 

«  f0{xi)fe{x2)  ■  ■  ■  /e(x„)(2e)", 

where  in  the  last  step  we  used  that  (see  also  Equation  (5.1)) 

pXi+e 

P{x^- £  <  X^  <Xi+£)  =  /  fe{x)dx  2£fs{x^). 

J  Xi  —  e 

Note  that  the  right-hand  side  of  (21.1)  is  maximal  whenever  the  function 
/e(xi)/e(x2)  •  •  •  feixn)  is  maximal,  irrespective  of  the  value  of  e.  In  view  of 
this,  given  a  dataset  xi,X2,  ■  ■  ■ ,  x„,  the  likelihood  function  L{9)  is  defined  by 

^(9)  =  fe{xi)fg{x2)  ■  ■  ■  fg{xn) 


in  the  continuous  case. 


Maximum  likelihood  estimates.  The  maximum  likelihood  es¬ 
timate  of  9  is  the  value  t  =  h(xi,X2,  ■  ■  ■  ,Xn)  that  maximizes  the 
likelihood  function  L{9).  The  corresponding  random  variable 

r  =  /i(Xi,X2,...,x„) 

is  called  the  maximum  likelihood  estimator  for  9. 
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As  an  example,  suppose  we  have  a  dataset  xi,  X2,  ■  ■  ■ ,  Xn  modeled  as  a  re¬ 
alization  of  a  random  sample  from  an  Exp{X)  distribution,  with  probability 
density  function  given  by  /A(a;)  =  0  if  x  <  0  and 

f^{x)  =  Xe-^^  for  X  >  0. 

Then  the  likelihood  is  given  by 

L{X)  =  /a(xi)/a(x2)  •  •  •  /a(x„) 

=  Ae-^“^  •  Ae-^^=  •  •  •  Ae'^^" 

_  yi  ^  ^-\{xi+X2-\ - \-Xn) 


To  obtain  the  maximum  likelihood  estimate  of  A  it  is  enough  to  find  the 
maximum  of  L{X).  To  do  so,  we  determine  the  derivative  of  L{X): 


dA 


L(A)  = 


Xi 


=  n 


We  see  that  d  (T(A))  /dA  =  0  if  and  only  if 

1  Xxji  —  0, 

i.e.,  if  A  =  l/x„.  Check  that  for  this  value  of  A  the  likelihood  function  L{X) 
attains  a  maximum!  So  the  maximum  likelihood  estimator  for  A  is  1/A„. 

In  the  example  of  the  number  of  cycles  up  to  pregnancy  of  smoking  women, 
we  have  seen  that  L{p)  =  C  •  (1  —  .  The  maximum  likelihood  estimate 

of  p  was  found  by  differentiating  L(jp).  Differentiating  is  not  always  possible, 
as  the  following  example  shows. 


Estimating  the  upper  endpoint  of  a  uniform  distribution 

Suppose  the  dataset  xi  =  0.98,  X2  =  1.57,  and  X3  =  0.31  is  the  realization 
of  a  random  sample  from  a  U{0,9)  distribution  with  0  >  0  unknown.  The 
probability  density  function  of  each  Xi  is  now  given  by  /e(x)  =  0  if  x  is  not 
in  [0,  9]  and 

fe{x)  =  \  for  0  <  X  <  6*. 

t7 

The  likelihood  L{9)  is  zero  if  9  is  smaller  than  at  least  one  of  the  Xj,  and 
equals  1/9^  if  9  is  greater  than  or  equal  to  each  of  the  three  Xi,  i.e.. 


L{9)  =  /e(xi)/e(x2)/e(x3) 


^  if  0  >  max (xi, X2, X3)  =  1.57 
0  if  0  <  max (xi, X2, X3)  =  1.57. 


21.3  Likelihood  and  loglikelihood 
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Fig.  21.1.  Likelihood  function  L{6)  of  a  sample  from  a  U{0,8)  distribution. 


Figure  21.1  depicts  this  likelihood  function.  One  glance  at  this  figure  is  enough 
to  realize  that  L(9)  attains  its  maximum  at  max(a;i,X2,a;3)  =  1.57. 

In  general,  given  a  dataset  xi, a;2, . . .  ,Xn  originating  from  a  U (0, 9)  distribu¬ 
tion,  we  see  that  L{9)  =  0  if  0  is  smaller  than  at  least  one  of  the  Xi  and  that 
L{0)  =  1/0"  if  0  is  greater  than  or  equal  to  the  largest  of  the  Xi.  We  conclude 
that  the  maximum  likelihood  estimator  of  0  is  given  by  max  {Xi ,  X2, . . . ,  Xn}. 

Loglikelihood 

In  the  preceding  example  it  was  easy  to  find  the  value  of  the  parameter  for 
which  the  likelihood  is  maximal.  Usually  one  can  find  the  maximum  by  dif¬ 
ferentiating  the  likelihood  function  L{9).  The  calculation  of  the  derivative  of 
L{9)  may  be  tedious,  because  L{9)  is  a  product  of  terms,  all  involving  0  (see 
also  Quick  exercise  21.3).  To  differentiate  L(0)  we  have  to  apply  the  product 
rule  from  calculus.  Considering  the  logarithm  of  L{9)  changes  the  product  of 
the  terms  involving  0  into  a  sum  of  logarithms  of  these  terms,  which  makes 
the  process  of  differentiating  easier.  Moreover,  because  the  logarithm  is  an  in¬ 
creasing  function,  the  likelihood  function  L(9)  and  the  loglikelihood  function 
£{9),  defined  by 

£(0)=ln(L(0)), 

attain  their  extreme  values  for  the  same  values  of  0.  In  particular,  L{9)  is 
maximal  if  and  only  if  £{9)  is  maximal.  This  is  illustrated  in  Figure  21.2  by 
the  likelihood  function  L{p)  =  Cp^^{l  —  and  the  loglikelihood  function 
i{p)  =  In(C')  -I-  931n(p)  -|-  3221n(l  —  p)  for  the  smokers. 

In  the  situation  that  we  have  a  dataset  xi,X2,  ■  ■  ■  ,Xn  modeled  as  a  realiza¬ 
tion  of  a  random  sample  from  an  Exp{X)  distribution,  we  found  as  likelihood 
function  L{\)  =  A"  •  ,  Therefore,  the  loglikelihood  function 

is  given  by 

£{\)  =  nln(A)  —  \  {xi  +  X2  -\ - -k  Xn)  ■ 
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Fig.  21.2.  The  graphs  of  the  likelihood  function  L(p)  and  the  loglikelihood  function 
l{p)  for  the  smokers. 

Quick  exercise  21.3  In  this  example,  use  the  loglikelihood  function  £(A)  to 
show  that  the  maximum  likelihood  estimate  of  A  equals  l/x„. 

Estimating  the  parameters  of  the  normal  distribution 

Suppose  that  the  dataset  xi ,  a;2 , . . . ,  is  a  realization  of  a  random  sample 
from  an  cr^)  distribution,  with  /r  and  a  unknown.  What  are  the  maximum 
likelihood  estimates  for  /i  and  cr? 

In  this  case  9  is  the  vector  {p.,(T),  and  therefore  the  likelihood  function  is  a 
function  of  two  variables: 


cr)  —  f f '  *  *  ///, (7(^71); 
where  each  is  the  probability  density  function: 


—00  <  X  <  00. 


Since 


one  finds  that 


The  partial  derivatives  of  i  are 


21.4  Properties  of  maximum  likelihood  estimators 
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Solving  —  =  0  and  —  =  0  yields 
a/r  oa 


^  =  Xn  and  a  = 


1 

\  n  ^ ' 


It  is  not  hard  to  show  that  for  these  values  of  /i  and  cr  the  likelihood  func¬ 
tion  L{fj,,a)  attains  a  maximum.  We  find  that  Xn  is  the  maximum  likelihood 
estimate  for  /r  and  that 


is  the  maximum  likelihood  estimate  for  cr. 

21.4  Properties  of  maximum  likelihood  estimators 

Apart  from  the  fact  that  the  maximum  likelihood  principle  provides  a  general 
principle  to  construct  estimators,  one  can  also  show  that  maximum  likelihood 
estimators  have  several  desirable  properties. 

Invariance  principle 

In  the  previous  example,  we  saw  that 


is  the  maximum  likelihood  estimator  for  the  parameter  a  of  an  N^/i,  cr^)  distri¬ 
bution.  Does  this  imply  that  is  the  maximum  likelihood  estimator  for  cr^? 
This  is  indeed  the  case!  In  general  one  can  show  that  if  T  is  the  maximum 
likelihood  estimator  of  a  parameter  9  and  g{9)  is  an  invertible  function  of  9, 
then  g(T)  is  the  maximum  likelihood  estimator  for  g{9). 
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Asymptotic  unbiasedness 


The  maximum  likelihood  estimator  T  may  be  biased.  For  example,  because 
for  the  previously  mentioned  maximum  likelihood  estimator 
of  the  parameter  of  an  distribution,  it  follows  from  Section  19.4 


that 


E  [Dl]  =  E 


n  —  1 


-Si 


n—1 


n  —  1 


We  see  that  is  a  biased  estimator  for  cr^,  but  also  that  as  n  goes  to 
infinity,  the  expected  value  of  converges  to  cr^.  This  holds  more  generally. 
Under  mild  conditions  on  the  distribution  of  the  random  variables  Xi  under 
consideration  (see,  e.g.,  [36]),  one  can  show  that  asymptotically  (that  is,  as 
the  size  n  of  the  dataset  goes  to  infinity)  maximum  likelihood  estimators  are 
unbiased.  By  this  we  mean  that  if  =  /i(Ai,  A2, . . . ,  A„)  is  the  maximum 
likelihood  estimator  for  a  parameter  9,  then 


lim  E  [Tn]  =  9. 

n — >-00 


Asymptotic  minimum  variance 

The  variance  of  an  unbiased  estimator  for  a  parameter  9  is  always  larger  than 
or  equal  to  a  certain  positive  number,  known  as  the  Cramer-Rao  lower  bound 
(see  Remark  20.2).  Again  under  mild  conditions  one  can  show  that  maxi¬ 
mum  likelihood  estimators  have  asymptotically  the  smallest  variance  among 
unbiased  estimators.  That  is,  asymptotically  the  variance  of  the  maximum 
likelihood  estimator  for  a  parameter  9  attains  the  Cramer-Rao  lower  bound. 


21.5  Solutions  to  the  quick  exercises 

21.1  In  the  case  that  only  the  first  three  chips  are  defective,  the  probability 
that  the  observed  data  occur  is  equal  to 

P(i?i  =  1,  i?2  =  1, =  Ij  =  0, . . . ,  Riq  =  0)  =  p^(l  —  pY ■ 

For  the  batch  where  about  10%  of  the  chips  are  defective  we  find  that 

P(Ri  =  l,R2  =  l,i?3  =  l,i?4  =  0,...,i?io  =  0)=  =  0.00048, 

whereas  for  the  other  batch  this  probability  is  equal  to  =  0.00098. 

So  the  probability  for  the  batch  with  about  50%  defective  chips  is  about  2 
times  larger  than  the  probability  for  the  other  batch.  In  view  of  this,  it  would 
be  reasonable  to  choose  the  other  batch,  not  the  tested  one. 
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21.2  From  Table  21.1  we  derive 

L{p)  =  constant  •  P(X,  =  •P(X,  =  12)®P(Xi  >  12)^^ 

=  constant  •  .  p  _  p)p]i07  . . .  _  p)iip]6  ,  12 

=  constant  •  ■  (1  —  p)®®®. 

Here  the  constant  is  the  number  of  ways  we  can  assign  198  ones,  107  twos,  . . . , 
6  twelves,  and  12  numbers  larger  than  12  to  486  nonsmokers.  Differentiating 
L(p)  with  respect  to  p  yields  that 

L' ip)  =  constant  •  [474p^^3(j^  _  ^^955  _  955^474  _  p^954j 

=  constant  •  —  p)®®"*  [474(1  —  p)  —  955p] 

=  constant  •  p^'^^{l  -  p)®®‘‘(474  -  1429p). 

Now  L'{p)  =  0  if  p  =  0,  p  =  1,  or  p  =  474/1429  =  0.33,  and  L{p)  attains  its 
unique  maximum  in  this  last  point. 

21.3  The  loglikelihood  function  L{\)  has  derivative 


One  finds  that  i'{X)  =  0  if  and  only  if  A  =  l/x„  and  that  this  is  a  maximum. 
The  maximum  likelihood  estimate  for  A  is  therefore  Ijxn- 


21.6  Exercises 

21.1  ffl  Consider  the  following  situation.  Suppose  we  have  two  fair  dice,  Di 
with  5  red  sides  and  1  white  side  and  D2  with  1  red  side  and  5  white  sides. 
We  pick  one  of  the  dice  randomly,  and  throw  it  repeatedly  until  red  comes 
up  for  the  first  time.  With  the  same  die  this  experiment  is  repeated  two  more 
times.  Suppose  the  following  happens: 


First  experiment:  first  red  appears  in  3rd  throw 
Second  experiment:  first  red  appears  in  5th  throw 
Third  experiment:  first  red  appears  in  4th  throw. 


Show  that  for  die  Di  this  happens  with  probability  5.7424  •  10“®,  and  for 
die  D2  the  probability  with  which  this  happens  is  8.9725  •  10“^.  Given  these 
probabilities,  which  die  do  you  think  we  picked? 

21.2  □  We  throw  an  unfair  coin  repeatedly  until  heads  comes  up  for  the  first 
time.  We  repeat  this  experiment  three  times  (with  the  same  coin)  and  obtain 
the  following  data: 
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First  experiment: 
Second  experiment: 
Third  experiment: 


heads  first  comes  up  in 
heads  first  comes  up  in 
heads  first  comes  up  in 


3rd  throw 
5th  throw 
4th  throw. 


Let  p  be  the  probability  that  heads  comes  up  in  a  throw  with  this  coin. 
Determine  the  maximum  likelihood  estimate  p  of  p. 


21.3  In  Exercise  17.4  we  modeled  the  hits  of  London  by  flying  bombs  by  a 
Poisson  distribution  with  parameter  p. 

a.  Use  the  data  from  Exercise  17.4  to  find  the  maximum  likelihood  estimate 
of  p. 

b.  Suppose  the  summarized  data  from  Exercise  17.4  got  corrupted  in  the 
following  way: 

Number  of  hits  Oorl  2  3  4567 

Number  of  squares  440  93  35  7  0  0  1 

Using  this  new  data,  what  is  the  maximum  likelihood  estimate  of  /r? 


21.4  ffl  In  Section  19.1,  we  considered  the  arrivals  of  packages  at  a  network 
server,  where  we  modeled  the  number  of  arrivals  per  minute  by  a  Pois{p) 
distribution.  Let  xi,X2-, . . .  ,a:„  be  a  realization  of  a  random  sample  from  a 
Pois{p)  distribution.  We  saw  on  page  286  that  a  natural  estimate  of  the 
probability  of  zeros  in  the  dataset  is  given  by 

number  of  Xi  equal  to  zero 
n 


a.  Show  that  the  likelihood  L{p)  is  given  by 


L{p) 


^-nii 

_  ,,X-l-\-X2-\ - hXn 

I  I H' 

Xi\  -  ■  -  Xn'- 


b.  Determine  the  loglikelihood  i{p)  and  the  formula  of  the  maximum  likeli¬ 
hood  estimate  for  p. 

c.  What  is  the  maximum  likelihood  estimate  for  the  probability  e~^  of  zero 
arrivals? 


21.5  □  Suppose  that  xi,X2, . . .  ,a;„  is  a  dataset,  which  is  a  realization  of  a 
random  sample  from  a  normal  distribution. 

a.  Let  the  probability  density  of  this  normal  distribution  be  given  by 

ffi{x)  =  for  — oo  <  x  <  oo. 

\'2tt 

Determine  the  maximum  likelihood  estimate  for  p. 
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b.  Now  suppose  that  the  density  of  this  normal  distribution  is  given  by 

f<j{x)  =  — ^=e“2^  for  —00  <  x  <  oo. 

(TV  27r 

Determine  the  maximum  likelihood  estimate  for  a. 

21.6  Let  xi, X2,  ■ .  ■  ,Xn  be  a  dataset  that  is  a  realization  of  a  random  sample 
from  a  distribution  with  probability  density  fsi^x)  given  by 

for  X  >  S 
0  for  <  S. 

a.  Draw  the  likelihood  L(S). 

b.  Determine  the  maximum  likelihood  estimate  for  S. 

21.7  □  Suppose  that  xi,  X2,  ■  ■  ■ ,  Xn  is  a  dataset,  which  is  a  realization  of  a  ran¬ 
dom  sample  from  a  Rayleigh  distribution,  which  is  a  continuous  distribution 
with  probability  density  function  given  by 

for  a;  >  0. 

In  this  case  what  is  the  maximum  likelihood  estimate  for  0? 

21.8  ffl  (Exercises  19.7  and  20.7  continued)  A  certain  type  of  plant  can  be  di¬ 
vided  into  four  types:  starchy-green,  starchy-white,  sugary-green,  and  sugary- 
white.  The  following  table  lists  the  counts  of  the  various  types  among  3839 
leaves. 


Type  Count 

Starchy-green  1997 

Sugary-white  32 

Starchy-white  906 

Sugary-green  904 


Setting 

1  if  the  observed  leave  is  of  type 

2  if  the  observed  leave  is  of  type 

3  if  the  observed  leave  is  of  type 

4  if  the  observed  leave  is  of  type 

the  probability  mass  function  p  of  X  is  given  by 
a  12  3 


starchy-green 

sugary-white 

starchy-white 

sugary-green, 

4 


p{a)  i(2  +  0)  \9  1(1-9)  1(1-9) 
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and  p{a)  =  0  for  all  other  a.  Here  0  <  0  <  1  is  an  unknown  parameter, 
which  was  estimated  in  Exercise  19.7.  We  want  to  find  a  maximum  likelihood 
estimate  of  0. 

a.  Use  the  data  to  find  the  likelihood  L{9)  and  the  loglikelihood  £{9). 

b.  What  is  the  maximum  likelihood  estimate  of  9  using  the  data  from  the 
preceding  table? 

c.  Suppose  that  we  have  the  counts  of  n  different  leaves:  ni  of  type  starchy- 
green,  712  of  type  sugary-white,  773  of  type  starchy-white,  and  774  of  type 
sugary-green  (so  n  =  rii  -|-  772  -I-  773  -I-  774).  Determine  the  general  formula 
for  the  maximum  likelihood  estimate  of  9. 

21.9  □  Let  xi,X2,  ■  ■  ■  ,Xnhe  Sk  dataset  that  is  a  realization  of  a  random  sample 
from  a  U{a,f3)  distribution  (with  a  and  (3  unknown,  a  <  /3).  Determine  the 
maximum  likelihood  estimates  for  a  and  (5. 

21.10  Let  xi,X2,---,Xn  be  a  dataset,  which  is  a  realization  of  a  random 
sample  from  a  Par{a)  distribution.  What  is  the  maximum  likelihood  estimate 
for  a! 

21.11  ffl  In  Exercise  4.13  we  considered  the  situation  where  we  have  a  box 
containing  an  unknown  number — say  N — of  identical  bolts.  In  order  to  get  an 
idea  of  the  size  of  N  we  introduced  three  random  variables  X,  U,  and  Z .  Here 
we  will  use  X  and  Y ,  and  in  the  next  exercise  Z ,  to  find  maximum  likelihood 
estimates  of  N . 

a.  Suppose  that  a:i,  2:2, . . . ,  a:„  is  a  dataset,  which  is  a  realization  of  a  random 
sample  from  a  Geoil/N)  distribution.  Determine  the  maximum  likelihood 
estimate  for  TV. 

b.  Suppose  that  7/1, 7/2,  ■  •  ■ ,  ?/n  is  a  dataset,  which  is  a  realization  of  a  random 
sample  from  a  discrete  uniform  distribution  on  1,  2, . . . ,  Determine  the 
maximum  likelihood  estimate  for  N . 

21.12  (Exercise  21.11  continued.)  Suppose  that  m  bolts  in  the  box  were 
marked  and  then  r  bolts  were  selected  from  the  box;  Z  is  the  number  of 
marked  bolts  in  the  sample.  (Recall  that  it  was  shown  in  Exercise  4.13  c  that 
Z  has  a  hypergeometric  distribution,  with  parameters  m,  N,  and  r.)  Suppose 
that  k  bolts  in  the  sample  were  marked.  Show  that  the  likelihood  L{N)  is 
given  by 


Next  show  that  L{N)  increases  for  N  <  mr/k  and  decreases  for  N  >  mr/k, 
and  conclude  that  mr/k  is  the  maximum  likelihood  estimate  for  N . 

21.13  Often  one  can  model  the  times  that  customers  arrive  at  a  shop  rather 
well  by  a  Poisson  process  with  (unknown)  rate  A  (customers/hour).  On  a 
certain  day,  one  of  the  attendants  noticed  that  between  noon  and  12.45  p.m. 
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two  customers  arrived,  and  another  attendant  noticed  that  on  the  same  day 
one  customer  arrived  between  12.15  and  1  p.m.  Use  the  observations  of  the 
attendants  to  determine  the  maximum  likelihood  estimate  of  A. 

21.14  A  very  inexperienced  archer  shoots  n  times  an  arrow  at  a  disc  of  (un¬ 
known)  radius  6.  The  disc  is  hit  every  time,  but  at  completely  random  places. 
Let  ri,r2,  ■  ■  ■  jTn  be  the  distances  of  the  various  hits  to  the  center  of  the  disc. 
Determine  the  maximum  likelihood  estimate  for  8. 


21.15  On  January  28,  1986,  the  main  fuel  tank  of  the  space  shuttle  Challenger 
exploded  shortly  after  takeoff.  Essential  in  this  accident  was  the  leakage  of 
some  of  the  six  0-rings  of  the  Challenger.  In  Section  1.4  the  probability  of 
failure  of  an  0-ring  is  given  by 


P{i) 


^a+bt 
1  _|_  ga-l-h-t  ’ 


where  t  is  the  temperature  at  launch  in  degrees  Fahrenheit.  In  Table  21.2 
the  temperature  t  (in  °F,  rounded  to  the  nearest  integer)  and  the  number  of 
failures  N  for  23  missions  are  given,  ordered  according  to  increasing  temper¬ 
atures.  (See  also  Figure  1.3,  where  these  data  are  graphically  depicted.)  Give 
the  likelihood  L{a,b)  and  the  loglikelihood  £{a,b). 


Table  21.2.  Space  shuttle  failure  data  of  pre- Challenger  missions. 


t 

53 

57 

58 

63 

66 

67 

67 

67 

N 

2 

1 

1 

1 

0 

0 

0 

0 

t 

68 

69 

70 

70 

70 

70 

72 

73 

N 

0 

0 

0 

0 

1 

1 

0 

0 

t 

75 

75 

76 

76 

78 

79 

81 

N 

0 

2 

0 

0 

0 

0 

0 

21.16  In  the  18th  century  Georges-Louis  Leclerc,  Gomte  de  Buffon  (1707- 
1788)  found  an  amusing  way  to  approximate  the  number  tt  using  probability 
theory  and  statistics.  Buffon  had  the  following  idea:  take  a  needle  and  a  large 
sheet  of  paper,  and  draw  horizontal  lines  that  are  a  needle-length  apart.  Throw 
the  needle  a  number  of  times  (say  n  times)  on  the  sheet,  and  count  how  often  it 
hits  one  of  the  horizontal  lines.  Say  this  number  is  s„,  then  is  the  realization 
of  a  Bin{n,p)  distributed  random  variable  Sn-  Here  p  is  the  probability  that 
the  needle  hits  one  of  the  horizontal  lines.  In  Exercise  9.20  you  found  that 
p  =  2/tt.  Show  that 

T  =  — 

Sn 

is  the  maximum  likelihood  estimator  for  tt. 
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The  maximum  likelihood  principle  provides  a  way  to  estimate  parameters.  The 
applicability  of  the  method  is  quite  general  but  not  universal.  For  example, 
in  the  simple  linear  regression  model,  introduced  in  Section  17.4,  we  need  to 
know  the  distribution  of  the  response  variable  in  order  to  find  the  maximum 
likelihood  estimates  for  the  parameters  involved.  In  this  chapter  we  will  see 
how  these  parameters  can  be  estimated  using  the  method  of  least  squares. 
Furthermore,  the  relation  between  least  squares  and  maximum  likelihood  will 
be  investigated  in  the  case  of  normally  distributed  errors. 


22.1  Least  squares  estimation  and  regression 

Recall  from  Section  17.4  the  simple  linear  regression  model  for  a  bivariate 
dataset  {xi^yi)^{x2,y2),  ■  ■  ■  ,{xn,yn)-  In  this  model  Xi,X2T--,Xn  are  non- 
random  and  j/i,  j/2,  •  ■  ■ ,  2/n  are  realizations  of  random  variables  Yi,  I2,  •  •  ■ ,  In 
satisfying 

Yi  =  a  +  l3xi  +  Ui  for  i  =  1, 2, . . . ,  n, 
where  t/i,  C/2,  •  ■  ■ ,  are  independent  random  variables  with  zero  expectation 
and  variance  cr^.  How  can  one  obtain  estimates  for  the  parameters  a,  /3,  and  cr^ 
in  this  model? 

Note  that  we  cannot  find  maximum  likelihood  estimates  for  these  parameters, 
simply  because  we  have  no  further  knowledge  about  the  distribution  of  the  Ui 
(and  consequently  of  the  Yi).  We  want  to  choose  a  and  /3  in  such  a  way  that 
we  obtain  a  line  that  fits  the  data  best.  A  classical  approach  to  do  this  is  to 
consider  the  sum  of  squared  distances  between  the  observed  values  yi  and  the 
values  a  +  f3xi  on  the  regression  line  y  =  a  +  I3x.  See  Figure  22.1,  where  these 
distances  are  indicated.  The  method  of  least  squares  prescribes  to  choose  a 
and  P  such  that  the  sum  of  squares 

n 

(3)  =  ^{yi  -  a  -  (3xi)^ 

2=1 
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Xi 


Fig.  22.1.  The  observed  value  yt  corresponding  to  Xi  and  the  value  a  +  fixi  on  the 
regression  line  1/  =  a  +  /3x. 


is  minimal.  The  ith  term  in  the  sum  is  the  squared  distance  in  the  vertical 
direction  from  {xi,yi)  to  the  line  y  =  a  +  fix.  To  find  these  so-called  least 
squares  estimates,  we  differentiate  S{a,P)  with  respect  to  a  and  /3,  and  we 
set  the  derivatives  equal  to  0: 

a  " 

—S{a,fi)  =  0  y^^{yi-a-  fixi)  =  0 

d  "" 

— S'(a,/3)  =  0  y^{yi-a-  fixjfxi  =  0. 

This  is  equivalent  to 

n  n 

na-\-  = 

n  n  n 

a'^Xi  +  fi^xl  =  ^  Xiyi. 
i—1  i—1  i=l 

For  example,  for  the  timber  data  from  Table  15.5  we  would  obtain 

36  a -h  1646.4/3  =  52  901 
1646.4  a  +  81750.02  /3  =  2  790  525. 

These  are  two  equations  with  two  unknowns  a  and  [3.  Solving  for  a  and  /3 
yields  the  solutions  a  =  —1160.5  and  [3  =  57.51.  In  Figure  22.2  a  scatterplot  of 
the  timber  dataset,  together  with  the  estimated  regression  line  y  =  -1160.5-1- 
57.51a:,  is  depicted. 

Quick  exercise  22.1  Suppose  you  are  given  a  piece  of  Australian  timber  with 
density  65.  What  would  you  choose  as  an  estimate  for  the  Janka  hardness? 


22.1  Least  squares  estimation  and  regression 
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Fig.  22.2.  Scatterplot  and  estimated  regression  line  for  the  timber  data. 


In  general,  writing  ^  instead  of  following  formulas  for  the 

estimates  a  (the  intercept)  and  (3  (the  slope): 

H  =  (22,1) 

a  =  yn-  Pxn-  (22.2) 

Since  S{a,j3)  is  an  elliptic  paraboloid  (a  “vase”),  it  follows  that  (d,/3)  is  the 
unique  minimum  of  S{a,(3)  (except  when  all  Xi  are  equal). 

Quick  exercise  22.2  Check  that  the  line  y  =  a  +  (3x  always  passes  through 
the  “center  of  gravity”  {x^yn)- 


Least  squares  estimators  are  unbiased 

We  denote  the  least  squares  estimates  by  a  and  (3.  It  is  quite  common  to  also 
denote  the  least  squares  estimators  by  a  and  /3: 


(x  —  Lli  f3X'f] 


P  = 


nY,XiYi-  {Y,Xi){Y,Yi) 


In  Exercise  22.12  it  is  shown  that  (3  is  an  unbiased  estimator  for  (3.  Using  this 
and  the  fact  that  E  [Yi]  =  a  +  Pxi  (see  page  258),  we  find  for  a: 


E  [a]  =  E  [y„]  -  XnY  [Z?]  =  -  X!  E 


2=1 


1  .  ^ 

=  -  y^(a  +  f3x^)  -  XnP  =  a-\-  l3Xn  —  XnP 
n 

i—\ 

=  a. 


We  see  that  a  is  an  unbiased  estimator  for  a. 
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An  unbiased  estimator  for 

In  the  simple  linear  regression  model  the  assumptions  imply  that  the  random 
variables  Yi  are  independent  with  variance  a^.  Unfortunately,  one  cannot  ap¬ 
ply  the  usual  estimator  (l/(n  —  1))  ~  ^0  the  variance  of  the 

Yi  (see  Section  19.4),  because  different  Yi  have  different  expectations.  What 
would  be  a  reasonable  estimator  for  cr^?  The  following  quick  exercise  suggests 
a  candidate. 


Quick  exercise  22.3  Let  t/i,  U2,  ■  ■  ■  ,Un  be  independent  random  variables, 
each  with  expected  value  zero  and  variance  cr^.  Show  that 


is  an  unbiased  estimator  for  cr^. 

At  first  sight  one  might  be  tempted  to  think  that  the  unbiased  estimator  T 
from  this  quick  exercise  is  a  useful  tool  to  estimate  .  Unfortunately,  we  only 
observe  the  Xi  and  Yi,  not  the  Ui.  However,  from  the  fact  that  Ui  =  Yi—a—fixi, 
it  seems  reasonable  to  try 


1 

-y{Y,-a-^^Xif  (22.3) 

i—\ 

as  an  estimator  for  cP' .  Tedious  calculations  show  that  the  expected  value  of 
this  random  variable  equals  But  then  we  can  easily  turn  it  into  an 

unbiased  estimator  for  . 


An  unbiased  estimator  for  .  In  the  simple  linear  regression 
model  the  random  variable 

i=l 

is  an  unbiased  estimator  for  cr^. 


22.2  Residuals 

A  way  to  explore  whether  the  simple  linear  regression  model  is  appropriate 
to  model  a  given  bivariate  dataset  is  to  inspect  a  scatterplot  of  the  so-called 
residuals  Vi  against  the  Xi .  The  zth  residual  is  defined  as  the  vertical  distance 
between  the  ith  point  and  the  estimated  regression  line: 


ri=y^-a-  /dx^, 


i  =  1,2, ...  ,n. 


22.2  Residuals  333 


When  a  linear  model  is  appropriate,  the  scatterplot  of  the  residuals  against 
the  Xi  should  show  truly  random  fluctuations  around  zero,  in  the  sense  that 
it  should  not  exhibit  any  trend  or  pattern.  This  seems  to  be  the  case  in 
Figure  22.3,  which  shows  the  residuals  for  the  black  cherry  tree  data  from 
Exercise  17.9. 
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0.00 
W 
0) 

Pi 

-0.05 

-0.10  -I  •  .  ; 

-0.15  - 

I - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 

0  2  4  6  8 

Fig.  22.3.  Scatterplot  of  Vi  versus  Xi  for  the  black  cherry  tree  data. 


Quick  exercise  22.4  Recall  from  Quick  exercise  22.2  that  {xn,yn)  is  on  the 
regression  line  y  =  a  +  [3x,  i.e.,  that  =  d  +  /3xn-  Use  this  to  show  that 
i-®-i  of  the  residuals  is  zero. 

In  Figure  22.4  we  depicted  versus  Xi  for  the  timber  dataset.  In  this  case  a 
slight  parabolic  pattern  can  be  observed.  Figures  22.2  and  22.4  suggest  that 
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Fig.  22.4.  Scatterplot  of  ri  versus  Xi  for  the  timber  data  with  the  simple  linear 
regression  model  Yi  =  a  +  f5xi  +  Ui. 
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for  the  timber  dataset  a  better  model  might  be 

Fi  =  a  +  fixi  +  "fxl  +  Ui  for  i  =  1,  2, . . . ,  n. 

In  this  new  model  the  residuals  are 

n  =  -  a  -  /3x^  -  jxj, 

where  d,  (3,  and  7  are  the  least  squares  estimates  obtained  by  minimizing 

n 

^{yi-a-  f3xi  -  jxj)  . 

In  Figure  22.5  we  depicted  Vi  versus  Xi.  The  residuals  display  no  trend  or 
pattern,  except  that  they  “fan  out” — an  example  of  a  phenomenon  called 
heteroscedasticity. 
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Fig.  22.5.  Scatterplot  of  Vi  versus  Xi  for  the  timber  data  with  the  model  Yt  = 
OL  +  fixi  +  73;f  +  Vi. 
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Heteroscedasticity 

The  assumption  of  equal  variance  of  the  Ui  (and  therefore  of  the  Yi)  is  called 
homoscedasticity.  In  case  the  variance  of  Yi  depends  on  the  value  of  Xi,  we 
speak  of  heteroscedasticity.  For  instance,  heteroscedasticity  occurs  when  Yi 
with  a  large  expected  value  have  a  larger  variance  than  those  with  small  ex¬ 
pected  values.  This  produces  a  “fanning  out”  effect,  which  can  be  observed 
in  Figure  22.5.  This  figure  strongly  suggests  that  the  timber  data  are  het- 
eroscedastic.  Possible  ways  out  of  this  problem  are  a  technique  called  weighted 
least  squares  or  the  use  of  variance-stabilizing  transformations. 


22.3  Relation  with  maximum  likelihood 


335 


22.3  Relation  with  maximum  likelihood 

To  apply  the  method  of  least  squares  no  assumption  is  needed  about  the  type 
of  distribution  of  the  Ui.  In  case  the  type  of  distribution  of  the  Ui  is  known, 
the  maximum  likelihood  principle  can  be  applied.  Consider,  for  instance,  the 
classical  situation  where  the  Ui  are  independent  with  an  N{Q,  distribution. 
What  are  the  maximum  likelihood  estimates  for  a  and  /3? 

In  this  case  the  Yi  are  independent,  and  Yi  has  an  N{a  +  distribution. 

Under  these  assumptions  and  assuming  that  the  linear  model  is  appropriate 
to  model  a  given  bivariate  dataset,  the  should  look  like  the  realization  of  a 
random  sample  from  a  normal  distribution.  As  an  example  a  histogram  of  the 
residuals  of  the  cherry  tree  data  of  Exercise  17.9  is  depicted  in  Figure  22.6. 


-0.2  -0.1  0.0  0.1  0.2 
Fig.  22.6.  Histogram  of  the  residuals  ri  for  the  black  cherry  tree  data. 


The  data  do  not  exhibit  strong  evidence  against  the  assumption  of  normality. 
When  Yi  has  an  N{a  +  (3xi,a^)  distribution,  the  probability  density  of  Yi  is 
given  by 


fi{y)  =  — 7=e  ^  for  —  oo  <  y  <  oo. 

ay  21: 

Since 

In  =  -  In(cr)  -  In(v^)  -  ^  (^- - , 

the  loglikelihood  is: 

£{a,P,a)  =  ln(/i(yi))  H - hln(/„(y„)) 

1  " 

=  -nln(tT)  -  nln{V^)  -  ^  '^{Ui  -  a-  /3xi)^. 
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Note  that  for  any  fixed  cr  >  0,  the  loglikelihood  £{a,  /?,  cr)  attains  its  maximum 
precisely  when  —  ct  —  is  minimal.  Hence,  in  case  the  Ui  are 

independent  with  an  N{0,a'^)  distribution,  the  maximum  likelihood  principle 
and  the  least  squares  method  yield  the  same  estimators. 

To  find  the  maximum  likelihood  estimate  of  a  we  differentiate  £{a,  /3,  a)  with 
respect  to  a: 


d  77-  1 

— £(a,  P,a)  = - £ 

0(7  a 

i=l 

It  follows  (from  the  invariance  principle  on  page  321)  that  the  maximum 
likelihood  estimator  of  cr^  is  given  by 

1  ” 

-  a  -  j3xif , 

n  ^ ' 

i=l 

which  is  the  estimator  from  (22.3). 


22.4  Solutions  to  the  quick  exercises 


22.1  We  can  use  the  estimated  regression  line  y  =  —1160. 5+57. 51a;  to  predict 
the  Janka  hardness.  For  density  a;  =  65  we  find  as  a  prediction  for  the  Janka 
hardness  y  =  2577.65. 


22.2  Rewriting  d  =  —  /3,  it  follows  that  yn  =  cx  +  /3a;„,  which  means  that 

(ain,  yn)  is  a  point  on  the  estimated  regression  line  y  =  a  +  j3x. 


22.3  We  need  to  show  that  E  [T]  =  Since  E[[/i]  =  0,  Var(C/i)  =  E  [C/f], 
so  that: 


E  [T]  =  E 


n 

-  ^  Var([/i)  =  cr^. 
2=1 


22.4  Since  Vi  =  yi  —  {a  -\-  ^Xi)  for  z  =  1,  2, . . . ,  it  follows  that  the  sum  of 
the  residuals  equals 

^  r,  =  ^  ?/i  -  (nd  +  /3  ^  x^ 

=  nyn  -  (na  +  nfixn^  =  n  (jjn  -  (d  +  /3T„)^  =  0, 

because  yn  =  d  +  j3xn,  according  to  Quick  exercise  22.2. 


22.5  Exercises 
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22.5  Exercises 

22.1  ffl  Consider  the  following  bivariate  dataset: 

(1,2)  (3,1.8)  (5,1). 

a.  Determine  the  least  squares  estimates  a  and  /3  of  the  parameters  of  the 
regression  line  y  =  a  +  fUx. 

b.  Determine  the  residuals  ri,r2,  and  and  check  that  they  add  up  to  0. 

c.  Draw  in  one  figure  the  scatterplot  of  the  data  and  the  estimated  regression 
line  y  =  a  +  fix. 

22.2  Adding  one  point  may  dramatically  change  the  estimates  of  a  and  (3. 
Suppose  one  extra  datapoint  is  added  to  the  dataset  of  the  previous  exercise 
and  that  we  have  as  dataset: 

(0,0)  (1,2)  (3,1.8)  (5,1). 

Determine  the  least  squares  estimate  of  [3.  A  point  such  as  (0,  0),  which  dra¬ 
matically  changes  the  estimates  for  a  and  /3,  is  called  a  leverage  point. 

22.3  Suppose  we  have  the  following  bivariate  dataset: 

(1,3.1)  (1.7,3.9)  (2.1, 3.8)  (2.5, 4.7)  (2.7,4.5). 

a.  Determine  the  least  squares  estimates  a  and  f3  of  the  parameters  of  the 
regression  line  y  =  a  +  fix.  You  may  use  that  J^Xi  =  10,  '^yi  =  20, 
^  xf  =  21.84,  and  ^  Xiyi  =  41.61. 

b.  Draw  in  one  figure  the  scatterplot  of  the  data  and  the  estimated  regression 
line  y  =  a  +  fix. 

22.4  We  are  given  a  bivariate  dataset  (xi,  j/i),  (a:2, 2/2),  ■  ■  • ,  (a^ioo,  ?/ioo)-  For 

this  bivariate  dataset  it  is  known  that  =  231.7,  =  2400.8,  '^yi  = 

321,  and  ^  Xiyi  =  5189.  What  are  the  least  squares  estimates  a  and  (3  of  the 
parameters  of  the  regression  line  y  =  a-\-  (3x1 

22.5  ffl  For  the  timber  dataset  it  seems  reasonable  to  leave  out  the  intercept  a 
( “no  hardness  without  density” ) .  The  model  then  becomes 

Yi=f3xi  +  Ui  for  i=l,2,...,n. 

Show  that  the  least  squares  estimator  /3  of  /3  is  now  given  by 

n 

- 

i=l 

by  minimizing  the  appropriate  sum  of  squares. 
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22.6  □  (Quick  exercise  22.1  and  Exercise  22.5  continued).  Suppose  we  are 

given  a  piece  of  Australian  timber  with  density  65.  What  would  you  choose 
as  an  estimate  for  the  Janka  hardness,  based  on  the  regression  model  with 
no  intercept?  Recall  that  =  2790525  and  =  81750.02  (see  also 

Section  22.1). 

22.7  Consider  the  dataset 


{xi,yi),{x2,y2),  ■  ■  ■  ,{Xn,yn), 


where  xi^X2,  ■  ■  ■  ,Xn  are  nonrandom  and  yi,y2,  ■  ■  ■  ,yn  are  realizations  of  ran¬ 
dom  variables  Yi,Y2, . . .  ,Yn,  satisfying 

y,  =  -h  Ci  for  i  =  l,2,...,n. 

Here  Ui,U2,  ■  ■  ■  ,Un  are  independent  random  variables  with  zero  expectation 
and  variance  cr^.  What  are  the  least  squares  estimates  for  the  parameters  a 
and  (3  in  this  model? 

22.8  □  Which  simple  regression  model  has  the  larger  residual  sum  of  squares 

Sr=i  model  with  intercept  or  the  one  without? 

22.9  For  some  datasets  it  seems  reasonable  to  leave  out  the  slope  (3.  For 
example,  in  the  jury  example  from  Section  6.3  it  was  assumed  that  the  score 
that  juror  i  assigns  when  the  performance  deserves  a  score  g  is  Yi  =  g  +  Zi, 
where  Zi  is  a  random  variable  with  values  around  zero.  In  general,  when  the 
slope  (3  is  left  out,  the  model  becomes 

Yi  =  a  +  Ui  for  i  =  1,2, ...  ,n. 

Show  that  Yn  is  the  least  squares  estimator  a  of  a. 

22.10  □  In  the  method  of  least  squares  we  choose  a  and  (3  in  such  a  way 
that  the  sum  of  squared  residuals  S{a,(3)  is  minimal.  Since  the  Ah  term  in 
this  sum  is  the  squared  vertical  distance  from  {xi,yi)  to  the  regression  line 
y  =  a  +  f3x,  one  might  also  wonder  whether  it  is  a  good  idea  to  replace  this 
squared  distance  simply  by  the  distance.  So,  given  a  bivariate  dataset 


{xi,yi),{x2,y2),  ■  ■  ■  ,{Xn,yn), 


choose  a  and  /3  in  such  a  way  that  the  sum 

n 

P)  =  y^Jyi-  a-  Pxt\ 

i=l 

is  minimal.  We  will  investigate  this  by  a  simple  example.  Consider  the  follow¬ 
ing  bivariate  dataset: 


(0,2), (1,2), (2,0). 


22.5  Exercises 
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a.  Determine  the  least  squares  estimates  a  and  (3,  and  draw  in  one  figure 
the  scatterplot  of  the  data  and  the  estimated  regression  line  y  =  a  +  Px. 
Finally,  determine  A{a,P). 

b.  One  might  wonder  whether  d  and  /3  also  minimize  A{a,  P).  To  investigate 
this,  choose  P  =  —1  and  find  a’s  for  which  A{a,  —1)  <  A{d,  P).  For  which 
a  is  A(a,—1)  minimal? 

c.  Find  a  and  P  for  which  A(a,  P)  is  minimal. 

22.11  Consider  the  dataset  (xi,  yi),  (x2, 2/2),  ■  •  ■  >  (a^n,  ?/n),  where  the  Xi  are 
nonrandom  and  the  yi  are  realizations  of  random  variables  Yi,Y2, ...  ,Yn  sat¬ 
isfying 

Yi  =  g(xi)  +  Ui  for  i  =  1, 2, . . . ,  n, 

where  C/i,  C/2,  ■  ■  • ,  Cn  are  independent  random  variables  with  zero  expecta¬ 
tion  and  variance  cr^.  Visual  inspection  of  the  scatterplot  of  our  dataset  in 
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Fig.  22.7.  Scatterplot  of  yi  versus  Xi. 


Figure  22.7  suggests  that  we  should  model  the  Yi  by 

Yi  =  Pxi  +  'yXi  +Ui  for  i  =  1,  2, . . . ,  n. 
a.  Show  that  the  least  squares  estimators  P  and  7  satisfy 

/3  X]  X] 


b.  Infer  from  a — for  instance,  by  using  linear  algebra — that  the  estimators 
P  and  7  are  given  by 


P  = 


{J2x^Y,){J2xf)  -  (E  xfYi) 
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22.12  ffl  The  least  square  estimator  /3  from  (22.1)  is  an  unbiased  estimator 
for  /3.  You  can  show  this  in  four  steps. 


a.  First  show  that 

g  r al  ^  nY,Xi&[Yi\  -  (X; a^i)(SEy»]) 


b.  Next  use  that  E  \Yi]  =  a  +  (3xi,  to  obtain  that 


r^i  _  nY,Xi{a  +  (ixi)  -  [na  + 

c.  Simplify  this  last  expression  to  find 

^  r  naY,Xi  +  nl3Y,  xf  -naJ2xi-  f3{J2 


d.  Finally,  conclude  that  /3  is  an  unbiased  estimator  for  /3. 
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Confidence  intervals  for  the  mean 


Sometimes,  a  range  of  plausible  values  for  an  unknown  parameter  is  preferred 
to  a  single  estimate.  We  shall  discuss  how  to  turn  data  into  what  are  called 
confidence  intervals  and  show  that  this  can  be  done  in  such  a  manner  that 
definite  statements  can  be  made  about  how  confident  we  are  that  the  true  pa¬ 
rameter  value  is  in  the  reported  interval.  This  level  of  confidence  is  something 
you  can  choose.  We  start  this  chapter  with  the  general  principle  of  confidence 
intervals.  We  continue  with  confidence  intervals  for  the  mean,  the  common 
way  to  refer  to  confidence  intervals  made  for  the  expected  value  of  the  model 
distribution.  Depending  on  the  situation,  one  of  the  four  methods  presented 
will  apply. 


23.1  General  principle 

In  previous  chapters  we  have  encountered  sample  statistics  as  estimators  for 
distribution  features.  This  started  somewhat  informally  in  Chapter  17,  where 
it  was  claimed,  for  example,  that  the  sample  mean  and  the  sample  variance 
are  usually  close  to  /r  and  cr^  of  the  underlying  distribution.  Bias  and  MSE 
of  estimators,  discussed  in  Chapters  19  and  20,  are  used  to  judge  the  quality 
of  estimators.  If  we  have  at  our  disposal  an  estimator  T  for  an  unknown 
parameter  0,  we  use  its  realization  t  as  our  estimate  for  0.  For  example,  when 
collecting  data  on  the  speed  of  light,  as  Michelson  did  (see  Section  13.1),  the 
unknown  speed  of  light  would  be  the  parameter  0,  our  estimator  T  could 
be  the  sample  mean,  and  Michelson’s  data  then  yield  an  estimate  t  for  0  of 
299  852.4  km/sec.  We  call  this  number  a  point  estimate:  if  we  are  required 
to  select  one  number,  this  is  it.  Had  the  measurements  started  a  day  earlier, 
however,  the  whole  experiment  would  in  essence  be  the  same,  but  the  results 
might  have  been  different.  Hence,  we  cannot  say  that  the  estimate  equals  the 
speed  of  light  but  rather  that  it  is  close  to  the  true  speed  of  light.  For  example, 
we  could  say  something  like:  “we  have  great  confidence  that  the  true  speed  of 
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light  is  somewhere  between  . . .  and  ...  .”  In  addition  to  providing  an  interval 
of  plausible  values  for  9  we  would  want  to  add  a  specific  statement  about  how 
confident  we  are  that  the  true  9  is  among  them. 

In  this  chapter  we  shall  present  methods  to  make  confidence  statements  about 
unknown  parameters,  based  on  knowledge  of  the  sampling  distributions  of  cor¬ 
responding  estimators.  To  illustrate  the  main  idea,  suppose  the  estimator  T 
is  unbiased  for  the  speed  of  light  9.  For  the  moment,  also  suppose  that  T 
has  standard  deviation  ar  =  100  km/sec  (we  shall  drop  this  unrealistic  as¬ 
sumption  shortly).  Then,  applying  formula  (13.1),  which  was  derived  from 
Chebyshev’s  inequality  (see  Section  13.2),  we  find 

P(|T-0|  <2aT)>  |.  (23.1) 

In  words  this  reads:  with  probability  at  least  75%,  the  estimator  T  is  within 
2aT  =  200  of  the  true  speed  of  light  9.  We  could  rephrase  this  as 

T  G  {9  —  200,  9  +  200)  with  probability  at  least  75%. 

However,  if  I  am  near  the  city  of  Paris,  then  the  city  of  Paris  is  near  me:  the 
statement  “T  is  within  200  of  0”  is  the  same  as  “6*  is  within  200  of  T,”  and 
we  could  equally  well  rephrase  (23.1)  as 

9  G  {T  —  200,  T  -I-  200)  with  probability  at  least  75%. 

Note  that  of  the  last  two  equations  the  first  is  a  statement  about  a  random 
variable  T  being  in  a,  fixed  interval,  whereas  in  the  second  equation  the  interval 
is  random  and  the  statement  is  about  the  probability  that  the  random  interval 
covers  the  fixed  but  unknown  9.  The  interval  (T  —  200,  T  -|-  200)  is  sometimes 
called  an  interval  estimator,  and  its  realization  is  an  interval  estimate. 

Evaluating  T  for  the  Michelson  data  we  find  as  its  realization  t  =  299  852.4, 
and  this  yields  the  statement 

9  G  (299  652.4,  300 052.4).  (23.2) 

Because  we  substituted  the  realization  for  the  random  variable,  we  cannot 
claim  that  (23.2)  holds  with  probability  at  least  75%:  either  the  true  speed  of 
light  9  belongs  to  the  interval  or  it  does  not;  the  statement  we  make  is  either 
true  or  false,  we  just  do  not  know  which.  However,  because  the  procedure 
guarantees  a  probability  of  at  least  75%  of  getting  a  “right”  statement,  we 
say: 


9  G  (299  652.4,  300  052.4)  with  confidence  at  least  75%.  (23.3) 

The  construction  of  this  confidence  interval  only  involved  an  unbiased  estima¬ 
tor  and  knowledge  of  its  standard  deviation.  When  more  information  on  the 
sampling  distribution  of  the  estimator  is  available,  more  refined  statements 
can  be  made,  as  we  shall  see  shortly. 
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Quick  exercise  23.1  Repeat  the  preceding  derivation,  starting  from  the 
statement  P(|T  —  9\  <  3(Tt)  >  8/9  (check  that  this  follows  from  Chebyshev’s 
inequality).  What  is  the  resulting  confidence  interval  for  the  speed  of  light, 
and  what  is  the  corresponding  confidence? 

A  general  definition 

Many  confidence  intervals  are  of  the  form^ 

{t  —  c  -  Gt,  t  +  c-  Gt) 

we  just  encountered,  where  c  is  a  number  near  2  or  3.  The  corresponding 
confidence  is  often  much  higher  than  in  the  preceding  example.  Because  there 
are  many  other  ways  confidence  intervals  can  (or  have  to)  be  constructed,  the 
general  definition  looks  a  bit  different. 


Confidence  intervals.  Suppose  a  dataset  xi,...,Xn  is  given, 
modeled  as  realization  of  random  variables  Xi, Let  9  be  the 
parameter  of  interest,  and  7  a  number  between  0  and  1.  If  there  exist 
sample  statistics  =  g{Xi, . . . ,  A„)  and  C/„  =  h{Xi, . . . ,  A„)  such 
that 

P(L„  <9<Un)=l 
for  every  value  of  9,  then 

(^n ;  I^n)  ; 

where  In  =  g{xi, . . . ,  Xn)  and  Un  =  h{xi, . . . ,  Xn),  is  called  a  1007% 
confidence  interval  for  9.  The  number  7  is  called  the  confidence  level. 


Sometimes  sample  statistics  and  Un  as  required  in  the  definition  do  not 
exist,  but  one  can  find  and  t/„  that  satisfy 

P(L„  <9<Un)>l. 

The  resulting  confidence  interval  (Z„,  u„)  is  called  a  conservative  1007%  confi¬ 
dence  interval  for  9:  the  actual  confidence  level  might  be  higher.  For  example, 
the  interval  in  (23.2)  is  a  conservative  75%  confidence  interval. 

Quick  exercise  23.2  Why  is  the  interval  in  (23.2)  a  conservative  75%  con¬ 
fidence  interval? 

There  is  no  way  of  knowing  whether  an  individual  confidence  interval  is  cor¬ 
rect,  in  the  sense  that  it  indeed  does  cover  9.  The  procedure  guarantees  that 
each  time  we  make  a  confidence  interval  we  have  probability  7  of  covering  9. 
What  this  means  in  practice  can  easily  be  illustrated  with  an  example,  using 
simulation: 

Another  form  is,  for  example,  (cit,  C2t). 


1 
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Generate  xi, . . . ,  X20  from  an  iV(0, 1)  distribution.  Next,  pretend  that 
it  is  known  that  the  data  are  from  a  normal  distribution  but  that  both 
fj,  and  (T  are  unknown.  Construct  the  90%  confidence  interval  for  the 
expectation  /i  using  the  method  described  in  the  next  section,  which 
says  to  use  {ln,Un)  with 


In  —  X20  —  1.729 


S20 


'^n 


^20  “1“  1-729 


520 

7^' 


where  X20  and  S20  are  the  sample  mean  and  standard  deviation.  Fi¬ 
nally,  check  whether  the  “true  /i,”  in  this  case  0,  is  in  the  confidence 
interval. 


We  repeated  the  whole  procedure  50  times,  making  50  confidence  intervals 
for  fj,.  Each  confidence  interval  is  based  on  a  fresh  independently  generated 
set  of  data.  The  50  intervals  are  plotted  in  Figure  23.1  as  horizontal  line 


n 

-1 


Fig.  23.1.  Fifty  90%  confidence  intervals  for  /r  =  0. 


1 
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segments,  and  at  /r  (0!)  a  vertical  line  is  drawn.  We  count  46  “hits”:  only  four 
intervals  do  not  contain  the  true  /r. 

Quick  exercise  23.3  Suppose  you  were  to  make  40  confidence  intervals  with 
confidence  level  95%.  About  how  many  of  them  should  you  expect  to  be 
“wrong”?  Should  you  be  surprised  if  10  of  them  are  wrong? 

In  the  remainder  of  this  chapter  we  consider  confidence  intervals  for  the  mean: 
confidence  intervals  for  the  unknown  expectation  /i  of  the  distribution  from 
which  the  sample  originates.  We  start  with  the  situation  where  it  is  known  that 
the  data  originate  from  a  normal  distribution,  first  with  known  variance,  then 
with  unknown  variance.  Then  we  drop  the  normal  assumption,  first  use  the 
bootstrap,  and  finally  show  how,  for  very  large  samples,  confidence  intervals 
based  on  the  central  limit  theorem  are  made. 


23.2  Normal  data 

Suppose  the  data  can  be  seen  as  the  realization  of  a  sample  Xi , . . . ,  X„  from 
an  A(/x,  cr^)  distribution  and  ^  is  the  (unknown)  parameter  of  interest.  If  the 
variance  is  known,  confidence  intervals  are  easily  derived.  Before  we  do 
this,  some  preparation  has  to  be  done. 

Critical  values 

We  shall  need  so-called  critical  values  for  the  standard  normal  distribution. 
The  critical  value  Zp  of  an  A(0, 1)  distribution  is  the  number  that  has  right 
tail  probability  p.  It  is  defined  by 

F{Z  >  Zp)  =p, 

where  Z  is  an  A(0, 1)  random  variable.  For  example,  from  Table  B.l  we  read 
P(Z  >  1.96)  =  0.025,  so  zo.025  =  1-96.  In  fact,  Zp  is  the  (1  —  p)th  quantile  of 
the  standard  normal  distribution: 


^(Zp)  =  P(Z  <  Zp)  =  1  -  p. 


By  the  symmetry  of  the  standard  normal  density,  P(Z  <  —Zp)  =  P(Z  >  Zp)  = 
p,  so  P{Z  >  —Zp)  =  1  —  p  and  therefore 

Zi—p  —  Zp. 

For  example,  zq.qts  =  —-^0.025  =  —1.96.  All  this  is  illustrated  in  Figure  23.2. 
Quick  exercise  23.4  Determine  zq.qi  and  zq.qs  from  Table  B.l. 
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Fig.  23.2.  Critical  values  of  the  standard  normal  distribution. 


Variance  known 

If  Xi, . . . ,  Xn  is  a  random  sample  from  an  iV(/r,  cr^)  distribution,  then  Xn  has 
an  lV(/r,  cP'  jn)  distribution,  and  from  the  properties  of  the  normal  distribution 
(see  page  106),  we  know  that 

— "  ^  has  an  V(0, 1)  distribution. 
cripn 


If  Cl  and  Cu  are  chosen  such  that  P(c/  <  Z  <  c„)  =7  for  an  fV(0, 1)  distributed 
random  variable  Z,  then 


7  =  P 


Xji  ^ 


I  O'  -  o 

—  P  (  0;  -=  7  M  ^  ^ 

pn  Jn 


o  -  o 

—  P I  Xfj^  Cu  ^  <  /X  ^  Xu  Cl 


We  have  found  that 

Zri  -  Xji  Cu 


a 

sjn 


and  Un  =  Xn-ci- 


satisfy  the  confidence  interval  definition:  the  interval  (L„,[/„)  covers  /i  with 
probability  7.  Therefore 


0  _  0  \ 

^/n  y/n ) 


is  a  1007%  confidence  interval  for  /i.  A  common  choice  is  to  divide  a  =  1  —  7 
evenly  between  the  tails, ^  that  is,  solve  c;  and  Cu  from 

^  Here  this  choice  could  be  motivated  by  the  fact  that  it  leads  to  the  shortest 
confidence  interval;  in  other  examples  the  shortest  interval  requires  an  asymmetric 
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P(Z  >  Cu)  =  and  P{Z  <  ci)  =  al2, 


so  that  Cu  =  Zai2  and  c/  =  Z\-al2  =  —Z'al2-  Summarizing,  the  100(1  —  a)% 
confidence  interval  for  /i  is: 


For  example,  if  a  =  0.05,  we  use  zo.025  =  1-96  and  the  95%  confidence  interval 
is 


Example:  gross  calorific  content  of  coal 

When  a  shipment  of  coal  is  traded,  a  number  of  its  properties  should  be  known 
accurately,  because  the  value  of  the  shipment  is  determined  by  them.  An  im¬ 
portant  example  is  the  so-called  gross  calorific  value,  which  characterizes  the 
heat  content  and  is  a  numerical  value  in  megajoules  per  kilogram  (MJ /kg) . 
The  International  Organization  of  Standardization  (ISO)  issues  standard  pro¬ 
cedures  for  the  determination  of  these  properties.  For  the  gross  calorific  value, 
there  is  a  method  known  as  ISO  1928.  When  the  procedure  is  carried  out  prop¬ 
erly,  resulting  measurement  errors  are  known  to  be  approximately  normal, 
with  a  standard  deviation  of  about  0.1  MJ/kg.  Laboratories  that  operate 
according  to  standard  procedures  receive  ISO  certificates.  In  Table  23.1,  a 
number  of  such  ISO  1928  measurements  is  given  for  a  shipment  of  Osterfeld 
coal  coded  262DE27. 


Table  23.1.  Gross  calorific  value  measurements  for  Osterfeld  262DE27. 


23.870 

23.730 

23.712 

23.760 

23.640 

23.850 

23.840 

23.860 

23.940 

23.830 

23.877 

23.700 

23.796 

23.727 

23.778 

23.740 

23.890 

23.780 

23.678 

23.771 

23.860 

23.690 

23.800 

Source:  A.M.H.  van  der  Veen  and  A.J.M.  Broos.  Interlaboratory  study  pro¬ 
gramme  “ILS  coal  characterization” — reported  data.  Technical  report,  NMi 
Van  Swinden  Laboratorium  B.V.,  The  Netherlands,  1996. 


We  want  to  combine  these  values  into  a  confidence  statement  about  the  “true” 
gross  calorific  content  of  Osterfeld  262DE27.  From  the  data,  we  compute  Xn  = 
23.788.  Using  the  given  a  =  0.1  and  a  =  0.05,  we  find  the  95%  confidence 
interval 


23.788-  1.96 


^  =  (23.747,  23.829)  MJ/kg, 


division  of  a.  If  you  are  only  concerned  with  the  left  or  right  boundary  of  the 
confidence  interval,  see  the  next  chapter. 
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Variance  unknown 


When  a  is  unknown,  the  fact  that 


has  a  standard  normal  distribution  has  become  useless,  as  it  involves  this  un¬ 
known  tj,  which  would  subsequently  appear  in  the  confidence  interval.  How¬ 
ever,  if  we  substitute  the  estimator  Sn  for  cr,  the  resulting  random  variable 


SnI'Jn 


has  a  distribution  that  only  depends  on  n  and  not  on  /i  or  a.  Moreover,  its 
density  can  be  given  explicitly. 

Definition.  A  continuous  random  variable  has  a  t- distribution  with 
parameter  m,  where  m  >  1  is  an  integer,  if  its  probability  density  is 
given  by 


m+l 


for  — oo  <  X  <  oo. 


where  km  =  ^  (t)  This  distribution  is  denoted 

by  tim)  and  is  referred  to  as  the  t-distribution  with  m  degrees  of 
freedom. 

The  normalizing  constant  km  is  given  in  terms  of  the  gamma  function,  which 
was  defined  on  page  157.  For  to  =  1,  it  evaluates  to  ki  =  I/tt,  and  the  resulting 
density  is  that  of  the  standard  Cauchy  distribution  (see  page  161).  If  X  has 
a  t{m)  distribution,  then  E[A]  =  0  for  to  >  2  and  Var(A)  =  to/(to  —  2) 
for  TO  >  3.  Densities  of  t-distributions  look  like  that  of  the  standard  normal 
distribution:  they  are  also  symmetric  around  0  and  bell-shaped.  As  to  goes 
to  infinity  the  limit  of  the  t(rn)  density  is  the  standard  normal  density.  The 
distinguishing  feature  is  that  densities  of  t-distributions  have  heavier  tails: 
f{x)  goes  to  zero  as  x  goes  to  -l-oo  or  — oo,  but  more  slowly  than  the  density 
4>{x)  of  the  standard  normal  distribution.  These  properties  are  illustrated  in 
Figure  23.3,  which  shows  the  densities  and  distribution  functions  of  the  t(l), 
t(2),  and  t(5)  distribution  as  well  as  those  of  the  standard  normal. 

We  will  also  need  critical  values  for  the  t{m)  distribution:  the  critical  value 
tm,p  is  the  number  satisfying 


P(T  >  tm,p)  =  P, 


where  T  is  a  t{m)  distributed  random  variable.  Because  the  t-distribution  is 
symmetric  around  zero,  using  the  same  reasoning  as  for  the  critical  values  of 
the  standard  normal  distribution,  we  find: 
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Fig.  23.3.  Three  t-distributions  and  the  standard  normal  distribution.  The  dotted 
line  corresponds  to  the  standard  normal.  The  other  distributions  depicted  are  the 
t(l),  t{2),  and  t{5),  which  in  that  order  resemble  the  standard  normal  more  and 
more. 


For  example,  in  Table  B.2  we  read  iio,o.oi  =  2.764,  and  from  this  we  deduce 
that  tio.0.99  ~  — 2.764. 

Quick  exercise  23.5  Determine  ^3,0.01  and  ^35, 0.9975  from  Table  B.2. 

We  now  return  to  the  distribution  of 

5'n/Vn 

and  construct  a  confidence  interval  for  fj,. 


The  studentized  mean  of  a  normal  random  sample.  For  a 
random  sample  Xi,...,X„  from  an  distribution,  the  stu¬ 

dentized  mean 

Snl\fn 

has  a  t{n  —  1)  distribution,  regardless  of  the  values  of  fi  and  a. 


From  this  fact  and  using  critical  values  of  the  t-distribution,  we  derive  that 


P 


SnlV^ 


1  —  a. 


(23.4) 


and  in  the  same  way  as  when  cr  is  known  it  now  follows  that  a  100(1  —  a)% 
confidence  interval  for  fj,  is  given  by: 
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^n— l,a/2  / — :  “t“  tn 

y/n 

Returning  to  the  coal  example,  there  was  another  shipment,  of  Daw  Mill 
258GB41  coal,  where  there  were  actually  some  doubts  whether  the  stated 
accuracy  of  the  ISO  1928  method  was  attained.  We  therefore  prefer  to  consider 
cr  unknown  and  estimate  it  from  the  data,  which  are  given  in  Table  23.2. 


Table  23.2.  Gross  calorific  value  measurements  for  Daw  Mill  258GB41. 


30.990 

31.030 

31.060 

30.921 

30.920 

30.990 

31.024 

30.929 

31.050 

30.991 

31.208 

30.830 

31.330 

30.810 

31.060 

30.800 

31.091 

31.170 

31.026 

31.020 

30.880 

31.125 

Source:  A.M.H.  van  der  Veen  and  A.J.M.  Broos.  Interlaboratory  study  pro¬ 
gramme  “ILS  coal  characterization” — reported  data.  Technical  report,  NMi 
Van  Swinden  Laboratorium  B.V.,  The  Netherlands,  1996. 


Doing  this,  we  find  Xn  =  31.012  and  =  0.1294.  Because  n  =  22,  for  a  95% 
confidence  interval  we  use  ^21, 0.025  =  2.080  and  obtain 

/  0  1294  0  1294\ 

(31.012  -  2.080  ^  ,  31.012  +  2.080  ^  j  =  (30.954,  31.069). 

Note  that  this  confidence  interval  is  (50%!)  wider  than  the  one  we  made  for 
the  Osterfeld  coal,  with  almost  the  same  sample  size.  There  are  two  reasons 
for  this;  one  is  that  cr  =  0.1  is  replaced  by  the  (larger)  estimate  =  0.1294, 
and  the  second  is  that  the  critical  value  zo.025  =  1-96  is  replaced  by  the  larger 
^21, 0.025  =  2.080.  The  differences  in  the  method  and  the  ingredients  seem 
minor,  but  they  matter,  especially  for  small  samples. 


23.3  Bootstrap  confidence  intervals 

It  is  not  uncommon  that  the  methods  of  the  previous  section  are  used  even 
when  the  normal  distribution  is  not  a  good  model  for  the  data.  In  some  cases 
this  is  not  a  big  problem:  with  small  deviations  from  normality  the  actual 
confidence  level  of  a  constructed  confidence  interval  may  deviate  only  a  few 
percent  from  the  intended  confidence  level.  For  large  datasets  the  central  limit 
theorem  in  fact  ensures  that  this  method  provides  confidence  intervals  with 
approximately  correct  confidence  levels,  as  we  shall  see  in  the  next  section. 

If  we  doubt  the  normality  of  the  data  and  we  do  not  have  a  large  sample,  usu¬ 
ally  the  best  thing  to  do  is  to  bootstrap.  Suppose  we  have  a  dataset  xi,. . .  ,Xn, 
modeled  as  a  realization  of  a  random  sample  from  some  distribution  F,  and 
we  want  to  construct  a  confidence  interval  for  its  (unknown)  expectation  /r. 
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In  the  previous  section  we  saw  that  it  suffices  to  find  numbers  c/  and  c„  such 


that 

The  100(1 


P  Ci  <  ^  ^  <  Cl 


=  1  —  0. 


SnlV^ 

a)%  confidence  interval  would  then  be 


where,  of  course,  Xn  and  s„  are  the  sample  mean  and  the  sample  standard 
deviation.  To  find  ci  and  c„  we  need  to  know  the  distribution  of  the  studentized 
mean 

_  Xn  —  fi 
Snl\fn' 


We  apply  the  bootstrap  principle.  From  the  data  xi, . . . ,  *„  we  determine  an 
estimate  F  of  F.  Let  X^, . . . ,  X*  be  a  random  sample  from  F,  with  fj.*  = 
E[X*],  and  consider 


x*-^^* 

Sn/Vn  ' 


The  distribution  of  T*  is  now  used  as  an  approximation  to  the  distribution 
of  T.  If  we  use  F  =  F„,  we  get  the  following. 


Empirical  bootstrap  simulation  for  the  studentized  mean. 
Given  a  dataset  Xi,X2,  ■  ■  ■  ,Xn,  determine  its  empirical  distribution 
function  F„  as  an  estimate  of  F.  The  expectation  corresponding 
to  Fn  is  jl*  =  Xn- 

1.  Generate  a  bootstrap  dataset  a;*,  a;^, . . . ,  x*  from  Fn- 

2.  Gompute  the  studentized  mean  for  the  bootstrap  dataset: 

^  X*n-Xn 
S*nlV^’ 

where  a;*  and  s*  are  the  sample  mean  and  sample  standard  de¬ 
viation  of  a;* ,  a:2 , . . . ,  a;* . 

Repeat  steps  1  and  2  many  times. 


From  the  bootstrap  experiment  we  can  determine  c*  and  c*  such  that 


P 


< 


X*-^i* 

S*/y/E 


1  —  a. 


By  the  bootstrap  principle  we  may  transfer  this  statement  about  the  distri¬ 
bution  of  T*  to  the  distribution  of  T-  That  is,  we  may  use  these  estimated 
critical  values  as  bootstrap  approximations  to  c/  and  c^: 


Cl  «  c*  and  c, 


c; 
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Therefore,  we  call 


a  100(1  —  a)%  bootstrap  confidence  interval  for  pL. 

Example:  the  software  data 

Recall  the  software  data,  a  dataset  of  interfailure  times  (see  Section  17.3). 
From  the  nature  of  the  data — failure  times  are  positive  numbers — and  the 
histogram  (Figure  17.5),  we  know  that  they  should  not  be  modeled  as  a  real¬ 
ization  of  a  random  sample  from  a  normal  distribution.  From  the  data  we  know 
Xn  =  656.88,  Sn  =  1037.3,  and  n  =  135.  We  generate  one  thousand  bootstrap 
datasets,  and  for  each  dataset  we  compute  t*  as  in  step  2  of  the  procedure.  The 
histogram  and  empirical  distribution  function  made  from  these  one  thousand 
values  are  estimates  of  the  density  and  the  distribution  function,  respectively, 
of  the  bootstrap  sample  statistic  T*;  see  Figure  23.4. 


Fig.  23.4.  Histogram  and  empirical  distribution  function  of  the  studentized  boot¬ 
strap  simulation  results  for  the  software  data. 


We  want  to  make  a  90%  bootstrap  confidence  interval,  so  we  need  c*  and  c* , 
or  the  0.05th  and  0.95th  quantile  from  the  empirical  distribution  function  in 
Figure  23.4.  The  50th  order  statistic  of  the  one  thousand  t*  values  is  —2.107. 
This  means  that  50  out  of  the  one  thousand  values,  or  5%,  are  smaller  than 
or  equal  to  this  value,  and  so  c*  =  —2.107.  Similarly,  from  the  951st  order 
statistic,  1.389,  we  obtain^  c*  =  1.389.  Inserting  these  values,  we  find  the 
following  90%  bootstrap  confidence  interval  for  p: 

®  These  results  deviate  slightly  from  the  definition  of  empirical  quantiles  as  given 
in  Section  16.3.  That  method  is  a  little  more  accurate. 
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/  1037  3  1037  3\ 

(  656.88-  1.389-^=,  656.88  -  (-2.107)-^=  j  =  (532.9,845.0). 

Quick  exercise  23.6  The  25th  and  976th  order  statistic  from  the  preceding 
bootstrap  results  are  —2.443  and  1.713,  respectively.  Use  these  numbers  to 
construct  a  confidence  interval  for  /r.  What  is  the  corresponding  confidence 
level? 

Why  the  bootstrap  may  be  better 

The  reason  to  use  the  bootstrap  is  that  it  should  lead  to  a  more  accurate 
approximation  of  the  distribution  of  the  studentized  mean  than  the  t(n  —  1) 
distribution  that  follows  from  assuming  normality.  If,  in  the  previous  example, 
we  would  think  we  had  normal  data,  we  would  use  critical  values  from  the 
t(134)  distribution:  ^134,0.05  =  1.656.  The  result  would  be 

/  1037  3  1037  3\ 

656.88-  1.656==,  656.88+  1.656==  =  (509.0,804.7). 

V  vl35  vl35  / 

Comparing  the  intervals,  we  see  that  here  the  bootstrap  interval  is  a  little 
larger  and,  as  opposed  to  the  t-interval,  not  centered  around  the  sample  mean 
but  skewed  to  the  right  side.  This  is  one  of  the  features  of  the  bootstrap: 
if  the  distribution  from  which  the  data  originate  is  skewed,  this  is  reflected 
in  the  confidence  interval.  Looking  at  the  histogram  of  the  software  data 
(Figure  17.5),  we  see  that  is  it  skewed  to  the  right:  it  has  a  long  tail  on  the 
right,  but  not  on  the  left,  so  the  same  most  likely  holds  for  the  distribution 
from  which  these  data  originate.  The  skewness  is  reflected  in  the  confidence 
interval,  which  extends  more  to  the  right  of  than  to  the  left.  In  some  sense, 
the  bootstrap  adapts  to  the  shape  of  the  distribution,  and  in  this  way  it  leads 
to  more  accurate  confidence  statements  than  using  the  method  for  normal 
data.  What  we  mean  by  this  is  that,  for  example,  with  the  normal  method 
only  90%  of  the  95%  confidence  statements  would  actually  cover  the  true 
value,  whereas  for  the  bootstrap  intervals  this  percentage  would  be  close  (r) 
to  95%. 


23.4  Large  samples 

A  variant  of  the  central  limit  theorem  states  that  as  n  goes  to  infinity,  the 
distribution  of  the  studentized  mean 

Snl\fn 

approaches  the  standard  normal  distribution.  This  fact  is  the  basis  for  so- 
called  large  sample  confidence  intervals.  Suppose  Ai,...,A„  is  a  random 
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sample  from  some  distribution  F  with  expectation  /i.  If  n  is  large  enough, 
we  may  use 


ft 

Snl\fn 


1  —  a. 


(23.5) 


This  implies  that  if  xi , . . . ,  can  be  seen  as  a  realization  of  a  random  sample 
from  some  unknown  distribution  with  expectation  ^  and  if  n  is  large  enough, 
then 


is  an  approximate  100(1  —  a)%  confidence  interval  for  fj,. 

Just  as  earlier  with  the  central  limit  theorem,  a  key  question  is  “how  big 
should  n  be?”  Again,  there  is  no  easy  answer.  To  give  you  some  idea,  we  have 
listed  in  Table  23.3  the  results  of  a  small  simulation  experiment.  For  each  of 
the  distributions,  sample  sizes,  and  confidence  levels  listed,  we  constructed 
10  000  confidence  intervals  with  the  large  sample  method;  the  numbers  listed 
in  the  table  are  the  confidence  levels  as  estimated  from  the  simulation,  the 
coverage  probabilities.  The  chosen  Pareto  distribution  is  very  skewed,  and  this 
shows;  the  coverage  probabilities  for  the  exponential  are  just  a  few  percent 
off. 


Table  23.3.  Estimated  coverage  probabilities  for  large  sample  confidence  intervals 
for  non-normal  data. 


7 

Distribution 

n 

0.900 

0.950 

Exp{l) 

20 

0.851 

0.899 

Exp{l) 

100 

0.890 

0.938 

Par{2.1) 

20 

0.727 

0.774 

Par{2.1) 

100 

0.798 

0.849 

In  the  case  of  simulation  one  can  often  quite  easily  generate  a  very  large 
number  of  independent  repetitions,  and  then  this  question  poses  no  problem. 
In  other  cases  there  may  be  nothing  better  to  do  than  hope  that  the  dataset 
is  large  enough.  We  give  an  example  where  (we  believe!)  this  is  definitely  the 
case. 

In  an  article  published  in  1910  ([28]),  Rutherford  and  Geiger  reported  their 
observations  on  the  radioactive  decay  of  the  element  polonium.  Using  a  small 
disk  coated  with  polonium  they  counted  the  number  of  emitted  alpha-particles 
during  2608  intervals  of  7.5  seconds  each.  The  dataset  consists  of  the  counted 
number  of  alpha-particles  for  each  of  the  2608  intervals  and  can  be  summarized 
as  in  Table  23.4. 


23.5  Solutions  to  the  quick  exercises 
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Table  23.4.  Alpha-particle  counts  for  2608  intervals  of  7.5  seconds. 


Count 

0 

1 

2 

3 

4 

Frequency 

57 

203 

383 

525 

532 

Count 

5 

6 

7 

8 

9 

Frequency 

408 

273 

139 

45 

27 

Count 

10 

11 

12 

13 

14 

Frequency 

10 

4 

0 

1 

1 

Source:  E.  Rutherford  and  H.  Geiger  (with  a  note  by  H.  Bateman),  The  proba¬ 
bility  variations  in  the  distribution  of  <y  particles,  Phil. Mag.,  6:  698—704,  1910; 
the  table  on  page  701. 


The  total  number  of  counted  alpha-particles  is  10  097,  the  average  number 
per  interval  is  therefore  3.8715.  The  sample  standard  deviation  can  also 
be  computed  from  the  table;  it  is  1.9225.  So  we  know  of  the  actual  data 
xi,X2, .  ■ . ,  a;2608  (where  the  counts  Xi  are  between  0  and  14)  that  Xn  =  3.8715 
and  Sn  =  1.9225.  We  construct  a  98%  confidence  interval  for  the  expected 
number  of  particles  per  interval.  As  zq.oi  =  2.33  this  results  in 

( 3.8715  -  2.33hEEE,  3.8715  -k  2.33hEEE^  =  (3.784,  3.959). 


23.5  Solutions  to  the  quick  exercises 

23.1  From  the  probability  statement,  we  derive,  using  ctt  =  100  and  8/9  = 
0.889: 

9  G  {T  —  300,  T  +  300)  with  probability  at  least  88%. 

With  t  =  299  852.4,  this  becomes 

9  €  (299  552.4,  300  152.4)  with  confidence  at  least  88%. 

23.2  Chebyshev’s  inequality  only  gives  an  upper  bound.  The  actual  value 
of  P(|T  —  9\  <  2aT)  could  be  higher  than  3/4,  depending  on  the  distribution 
of  T.  For  example,  in  Quick  exercise  13.2  we  saw  that  in  case  of  an  exponen¬ 
tial  distribution  this  probability  is  0.865.  For  other  distributions,  even  higher 
values  are  attained;  see  Exercise  13.1. 

23.3  For  each  of  the  confidence  intervals  we  have  a  5%  probability  that 
it  is  wrong.  Therefore,  the  number  of  wrong  confidence  intervals  has  a 
iJm(40,0.05)  distribution,  and  we  would  expect  about  40  •  0.05  =  2  to  be 
wrong.  The  standard  deviation  of  this  distribution  is  ^40  •  0.05  •  0.95  =  1.38. 
The  outcome  “10  confidence  intervals  wrong”  is  (10  —  2)/1.38  =  5.8  standard 
deviations  from  the  expectation  and  would  be  a  surprising  outcome  indeed. 
(The  probability  of  10  or  more  wrong  is  0.00002.) 
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23.4  We  need  to  solve  P(Z  >  a)  =  0.01.  In  Table  B.l  we  find  P(Z  >  2.33)  = 
0.0099  «  0.01,  so  zo.oi  ~  2.33.  For  zo.95  we  need  to  solve  P(Z  >  a)  =  0.95, 
and  because  this  is  in  the  left  tail  of  the  distribution,  we  use  zo.95  =  —^o.05- 
In  the  table  we  read  P(Z  >  1.64)  =  0.0505  and  P(Z  >  1.65)  =  0.0495,  from 
which  we  conclude  zo.05  ~  (1.64+  1.65)/2  =  1.645  and  zo.95  ~  —1.645. 

23.5  In  Table  B.l  we  find  P(T3  >  4.541)  =  0.01,  so  ^3,0.01  =  4.541.  For 
^35, 0.9975 1  we  need  to  use  ^35, 0.9975  =  —^35, 0.0025-  In  the  table  we  find  too, 0.0025  = 
3.030  and  <40, 0.0025  =  2.971,  and  by  interpolation  ^35, 0.0025  ~  (3.030  + 
2.971)/2  =  3.0005.  Hence,  *35.0.9975  «  -3.000. 

23.6  The  order  statistics  are  estimates  for  C0.Q25  and  Cg  975,  respectively.  So 
the  corresponding  a  is  0.05,  and  the  95%  bootstrap  confidence  interval  for  /r 
is: 

/  1 037  3  1 037  3\ 

(  656.88-  1.713  ,  656.88  -  (-2.443)  J  =  (504.0,875.0). 


23.6  Exercises 

23.1  □  A  bottling  machine  is  known  to  fill  wine  bottles  with  amounts  that 
follow  an  A(/i,  cr^)  distribution,  with  cr  =  5  (ml).  In  a  sample  of  16  bottles, 
X  =  743  (ml)  was  found.  Construct  a  95%  confidence  interval  for  fj.. 

23.2  □  You  are  given  a  dataset  that  may  be  considered  a  realization  of  a 
normal  random  sample.  The  size  of  the  dataset  is  34,  the  average  is  3.54,  and 
the  sample  standard  deviation  is  0.13.  Construct  a  98%  confidence  interval 
for  the  unknown  expectation  fx. 

23.3  You  have  ordered  10  bags  of  cement,  which  are  supposed  to  weigh  94  kg 
each.  The  average  weight  of  the  10  bags  is  93.5  kg.  Assuming  that  the  10 
weights  can  be  viewed  as  a  realization  of  a  random  sample  from  a  normal 
distribution  with  unknown  parameters,  construct  a  95%  confidence  interval 
for  the  expected  weight  of  a  bag.  The  sample  standard  deviation  of  the  10 
weights  is  0.75. 

23.4  A  new  type  of  car  tire  is  launched  by  a  tire  manufacturer.  The  auto¬ 
mobile  association  performs  a  durability  test  on  a  random  sample  of  18  of 
these  tires.  For  each  tire  the  durability  is  expressed  as  a  percentage:  a  score 
of  100  (%)  means  that  the  tire  lasted  exactly  as  long  as  the  average  standard 
tire,  an  accepted  comparison  standard.  From  the  multitude  of  factors  that  in¬ 
fluence  the  durability  of  individual  tires  the  assumption  is  warranted  that  the 
durability  of  an  arbitrary  tire  follows  an  iV(/i,  cr^)  distribution.  The  parame¬ 
ters  /r  and  cr^  characterize  the  tire  type,  and  /i  could  be  called  the  durability 
index  for  this  type  of  tire.  The  automobile  association  found  for  the  tested 
tires:  xis  =  195.3  and  sis  =  16.7.  Construct  a  95%  confidence  interval  for  p. 


23.6  Exercises 


357 


23.5  ffl  During  the  2002  Winter  Olympic  Games  in  Salt  Lake  City  a  newspaper 
article  mentioned  the  alleged  advantage  speed-skaters  have  in  the  1500  m  race 
if  they  start  in  the  outer  lane.  In  the  men’s  1500  m,  there  were  24  races,  but 
in  race  13  (really!)  someone  fell  and  did  not  finish.  The  results  in  seconds  of 
the  remaining  23  races  are  listed  in  Table  23.5.  You  should  know  that  who 
races  against  whom,  in  which  race,  and  who  starts  in  the  outer  lane  are  all 
determined  by  a  fair  lottery. 


Table  23.5.  Speed-skating  results  in  seconds,  men’s  1500  m  (except  race  13),  2002 
Winter  Olympic  Games. 


Race 

Inner 

Outer  Difference 

number 

lane 

lane 

1 

107.04 

105.98 

1.06 

2 

109.24 

108.20 

1.04 

3 

111.02 

108.40 

2.62 

4 

108.02 

108.58 

-0.56 

5 

107.83 

105.51 

2.32 

6 

109.50 

112.01 

-2.51 

7 

111.81 

112.87 

-1.06 

8 

111.02 

106.40 

4.62 

9 

106.04 

104.57 

1.47 

10 

110.15 

110.70 

-0.55 

11 

109.42 

109.45 

-0.03 

12 

108.13 

109.57 

-1.44 

14 

105.86 

105.97 

-0.11 

15 

108.27 

105.63 

2.64 

16 

107.63 

105.41 

2.22 

17 

107.72 

110.26 

-2.54 

18 

106.38 

105.82 

0.56 

19 

107.78 

106.29 

1.49 

20 

108.57 

107.26 

1.31 

21 

106.99 

103.95 

3.04 

22 

107.21 

106.00 

1.21 

23 

105.34 

105.26 

0.08 

24 

108.76 

106.75 

2.01 

Mean 

108.25 

107.43 

0.82 

St. dev. 

1.70 

2.42 

1.78 

a.  As  a  consequence  of  the  lottery  and  the  fact  that  many  different  factors 
contribute  to  the  actual  time  difference  “inner  lane  minus  outer  lane”  the 
assumption  of  a  normal  distribution  for  the  difference  is  warranted.  The 
numbers  in  the  last  column  can  be  seen  as  realizations  from  an  N{S,a'^) 
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distribution,  where  5  is  the  expected  outer  lane  advantage.  Construct  a 
95%  confidence  interval  for  5.  N.B.  n  =  23,  not  24! 

b.  You  decide  to  make  a  bootstrap  confidence  interval  instead.  Describe  the 
appropriate  bootstrap  experiment. 

c.  The  bootstrap  experiment  was  performed  with  one  thousand  repetitions. 
Part  of  the  bootstrap  outcomes  are  listed  in  the  following  table.  From  the 
ordered  list  of  results,  numbers  21  to  60  and  941  to  980  are  given.  Use 
these  to  construct  a  95%  bootstrap  confidence  interval  for  5. 


21-25 

-2.202 

-2.164 

-2.111 

-2.109 

-2.101 

26-30 

-2.099 

-2.006 

-1.985 

-1.967 

-1.929 

31-35 

-1.917 

-1.898 

-1.864 

-1.830 

-1.808 

36-40 

-1.800 

-1.799 

-1.774 

-1.773 

-1.756 

41-45 

-1.736 

-1.732 

-1.731 

-1.717 

-1.716 

46-50 

-1.699 

-1.692 

-1.691 

-1.683 

-1.666 

51-55 

-1.661 

-1.644 

-1.638 

-1.637 

-1.620 

56-60 

-1.611 

-1.611 

-1.601 

-1.600 

-1.593 

941-945 

1.648 

1.667 

1.669 

1.689 

1.696 

946-950 

1.708 

1.722 

1.726 

1.735 

1.814 

951-955 

1.816 

1.825 

1.856 

1.862 

1.864 

956-960 

1.875 

1.877 

1.897 

1.905 

1.917 

961-965 

1.923 

1.948 

1.961 

1.987 

2.001 

966-970 

2.015 

2.015 

2.017 

2.018 

2.034 

971-975 

2.035 

2.037 

2.039 

2.053 

2.060 

976-980 

2.088 

2.092 

2.101 

2.129 

2.143 

23.6  ffl  A  dataset  a;i,  a;2, . . . ,  a;„  is  given,  modeled  as  realization  of  a  sam¬ 
ple  Ai,  X2,  ■  ■  ■  ,Xn  from  an  Y(p,,  1)  distribution.  Suppose  there  are  sample 
statistics  L„  =  g(Ai, . . . ,  A„)  and  C/„  =  /i(Ai, . . . ,  A„)  such  that 

P(i„  <^i<Un)=  0.95 

for  every  value  of  Suppose  that  the  corresponding  95%  confidence  interval 
derived  from  the  data  is  {In,  Un)  =  (—2,  5). 

a.  Suppose  0  =  3^  -I-  7.  Let  =  3L„  -I-  7  and  t/„  =  3C/„  -I-  7.  Show  that 
p(l„  <  6»  <  iln^  =  0.95. 

b.  Write  the  95%  confidence  interval  for  0  in  terms  of  In  and 

c.  Suppose  9  =  1  —  Again,  find  and  Un,  as  well  as  the  confidence 
interval  for  0. 

d.  Suppose  6  =  fj? .  Can  you  construct  a  confidence  interval  for  91 
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23.7  □  A  95%  confidence  interval  for  the  parameter  /i  of  a  Pois{fj.)  distri¬ 
bution  is  given:  (2,3).  Let  A  be  a  random  variable  with  this  distribution. 
Construct  a  95%  confidence  interval  for  P(A  =  0)  =  e~^. 

23.8  Suppose  that  in  Exercise  23.1  the  content  of  the  bottles  has  to  be  de¬ 
termined  by  weighing.  It  is  known  that  the  wine  bottles  involved  weigh  on 
average  250  grams,  with  a  standard  deviation  of  15  grams,  and  the  weights 
follow  a  normal  distribution.  For  a  sample  of  16  bottles,  an  average  weight  of 
998  grams  was  found.  You  may  assume  that  1  ml  of  wine  weighs  1  gram,  and 
that  the  filling  amount  is  independent  of  the  bottle  weight.  Construct  a  95% 
confidence  interval  for  the  expected  amount  of  wine  per  bottle,  /r. 

23.9  Consider  the  alpha-particle  counts  discussed  in  Section  23.4;  the  data 
are  given  in  Table  23.4.  We  want  to  bootstrap  in  order  to  make  a  bootstrap 
confidence  interval  for  the  expected  number  of  particles  in  a  7.5-second  inter¬ 
val. 


a.  Describe  in  detail  how  you  would  perform  the  bootstrap  simulation. 

b.  The  bootstrap  experiment  was  performed  with  one  thousand  repetitions. 
Part  of  the  (ordered)  bootstrap  t*’s  are  given  in  the  following  table.  Con¬ 
struct  the  95%  bootstrap  confidence  interval  for  the  expected  number  of 
particles  in  a  7.5-second  interval. 


1-5 

-2.996 

-2.942 

-2.831 

-2.663 

-2.570 

6-10 

-2.537 

-2.505 

-2.290 

-2.273 

-2.228 

11-15 

-2.193 

-2.112 

-2.092 

-2.086 

-2.045 

16-20 

-1.983 

-1.980 

-1.978 

-1.950 

-1.931 

21-25 

-1.920 

-1.910 

-1.893 

-1.889 

-1.888 

26-30 

-1.865 

-1.864 

-1.832 

-1.817 

-1.815 

31-35 

-1.755 

-1.751 

-1.749 

-1.746 

-1.744 

36-40 

-1.734 

-1.723 

-1.710 

-1.708 

-1.705 

41-45 

-1.703 

-1.700 

-1.696 

-1.692 

-1.691 

46-50 

-1.691 

-1.675 

-1.660 

-1.656 

-1.650 

951-955 

1.635 

1.638 

1.643 

1.648 

1.661 

956-960 

1.666 

1.668 

1.678 

1.681 

1.686 

961-965 

1.692 

1.719 

1.721 

1.753 

1.772 

966-970 

1.773 

1.777 

1.806 

1.814 

1.821 

971-975 

1.824 

1.826 

1.837 

1.838 

1.845 

976-980 

1.862 

1.877 

1.881 

1.883 

1.956 

981-985 

1.971 

1.992 

2.060 

2.063 

2.083 

986-990 

2.089 

2.177 

2.181 

2.186 

2.224 

991-995 

2.234 

2.264 

2.273 

2.310 

2.348 

996-1000 

2.483 

2.556 

2.870 

2.890 

3.546 
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c.  Answer  this  without  doing  any  calculations:  if  we  made  the  98%  boot¬ 
strap  confidence  interval,  would  it  be  smaller  or  larger  than  the  interval 
constructed  in  Section  23.4? 

23.10  In  a  report  you  encounter  a  95%  confidence  interval  (1.6, 7.8)  for  the 
parameter  /i  of  an  A(/i,  cr^)  distribution.  The  interval  is  based  on  16  observa¬ 
tions,  constructed  according  to  the  studentized  mean  procedure. 

a.  What  is  the  mean  of  the  (unknown)  dataset? 

b.  You  prefer  to  have  a  99%  confidence  interval  for  /i.  Construct  it. 

23.11  ffl  A  95%  confidence  interval  for  the  unknown  expectation  of  some 
distribution  contains  the  number  0. 

a.  We  construct  the  corresponding  98%  confidence  interval,  using  the  same 
data.  Will  it  contain  the  number  0? 

b.  The  confidence  interval  in  fact  is  a  bootstrap  confidence  interval.  We  re¬ 
peat  the  bootstrap  experiment  (using  the  same  data)  and  construct  a  new 
95%  confidence  interval  based  on  the  results.  Will  it  contain  the  number  0? 

c.  We  collect  new  data,  resulting  in  a  dataset  of  the  same  size.  With  this  data, 
we  construct  a  95%  confidence  interval  for  the  unknown  expectation.  Will 
the  interval  contain  0? 

23.12  Let  Zi, . . . ,  Zn  he  a  random  sample  from  an  N(0, 1)  distribution.  Define 
Xi  =  fj,  +  aZi  for  i  =  1, ...  ,n  and  cr  >  0.  Let  Z,  X  denote  the  sample  averages 
and  Sz  and  Sx  the  sample  standard  deviations,  of  the  Z^  and  Xi,  respectively. 

a.  Show  that  Xi, . . . ,  A„  is  a  random  sample  from  an  Y(/i,  a^)  distribution. 

b.  Express  X  and  Sx  in  terms  of  Z,  Sz,  /r,  and  a. 

c.  Verify  that 

X- fi  _  Z 
Sx/y/n  Szl^/n' 

and  explain  why  this  shows  that  the  distribution  of  the  studentized  mean 
does  not  depend  on  /i  and  tr. 


24 


More  on  confidence  intervals 


While  in  Chapter  23  we  were  solely  concerned  with  confidence  intervals  for 
expectations,  in  this  chapter  we  treat  a  variety  of  topics.  First,  we  focus  on 
confidence  intervals  for  the  parameter  p  of  the  binomial  distribution.  Then, 
based  on  an  example,  we  briefly  discuss  a  general  method  to  construct  confi¬ 
dence  intervals.  One-sided  confidence  intervals,  or  upper  and  lower  confidence 
bounds,  are  discussed  next.  At  the  end  of  the  chapter  we  investigate  the  ques¬ 
tion  of  how  to  determine  the  sample  size  when  a  confidence  interval  of  a  certain 
width  is  desired. 


24.1  The  probability  of  success 

A  common  situation  is  that  we  observe  a  random  variable  X  with  a  Bin{n,p) 
distribution  and  use  X  to  estimate  p.  For  example,  if  we  want  to  estimate 
the  proportion  of  voters  that  support  candidate  G  in  an  election,  we  take  a 
sample  from  the  voter  population  and  determine  the  proportion  in  the  sample 
that  supports  G.  If  n  individuals  are  selected  at  random  from  the  population, 
where  a  proportion  p  supports  candidate  G,  the  number  of  supporters  X  in 
the  sample  is  modeled  by  a  Bin(n,p)  distribution;  we  count  the  supporters  of 
candidate  G  as  “successes.”  Usually,  the  sample  proportion  X/n  is  taken  as 
an  estimator  for  p. 

If  we  want  to  make  a  confidence  interval  for  p,  based  on  the  number  of  suc¬ 
cesses  X  in  the  sample,  we  need  to  find  statistics  L  and  U  (see  the  definition 
of  confidence  intervals  on  page  343)  such  that 

F{L<p<U)  =  l-a, 

where  L  and  U  are  to  be  based  on  X  only.  In  general,  this  problem  does 
not  have  a  solution.  However,  the  method  for  large  n  described  next,  some¬ 
times  called  “the  Wilson  method”  (see  [40]),  yields  confidence  intervals  with 
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confidence  level  approximately  100(1  —  a)%.  (How  close  the  true  confidence 
level  is  to  100(1  —  a) %  depends  on  the  (unknown)  p,  though  it  is  known  that 
for  p  near  0  and  1  it  is  too  low.  For  some  details  and  an  alternative  for  this 
situation,  see  Remark  24.1.) 

Recall  the  normal  approximation  to  the  binomial  distribution,  a  consequence 
of  the  central  limit  theorem  (see  page  201  and  Exercise  14.5):  for  large  n,  the 
distribution  of  X  is  approximately  normal  and 

X  —  np 
y/np{l-p) 


is  approximately  standard  normal.  By  dividing  by  n  in  both  the  numerator 
and  the  denominator,  we  see  that  this  equals: 


Therefore,  for  large  n 


p(i-p) 


P 


^-p 
n  ^ 


p(i-p) 


1  —  a. 


Note  that  the  event 


i/2  < 


-P 


P(l-P) 


<  Za/2 


is  the  same  as 


To  derive  expressions  for  L  and  U  we  can  rewrite  the  inequality  in  this  state¬ 
ment  to  obtain  the  form  L  <  p  <  U ,  but  the  resulting  formulas  are  rather 
awkward.  To  obtain  the  confidence  interval,  we  instead  substitute  the  data 
values  directly  and  then  solve  for  p,  which  yields  the  desired  result. 

Suppose,  in  a  sample  of  125  voters,  78  support  one  candidate.  What  is  the  95% 
confidence  interval  for  the  population  proportion  p  supporting  that  candidate? 
The  realization  of  X  is  x  =  78  and  n  =  125.  We  substitute  this,  together  with 
Za/2  =  -^0.025  =  1.96,  in  the  last  inequality: 


(1.96)2 


p{l-p)  <  0, 


125 
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Fig.  24.1.  The  parabola  1.0307p^  —  1.2787p  +  0.3894  and  the  resulting  confidence 
interval. 


or,  working  out  squares  and  products  and  grouping  terms: 

1.0307/  -  1.2787p+  0.3894  <  0. 

This  quadratic  form  describes  a  parabola,  which  is  depicted  in  Figure  24.1. 
Also,  for  other  values  of  n  and  x  there  always  results  a  quadratic  inequality  like 
this,  with  a  positive  coefficient  for  and  a  similar  picture.  For  the  confidence 
interval  we  need  to  find  the  values  where  the  parabola  intersects  the  horizontal 
axis.  The  solutions  we  find  are: 

-(-1.2787)  ±  /(-1.2787)2-4- 1.0307-0.3894  ^ 

pi,2  =  - - =  0.6203  ±  0.0835; 

hence,  I  =  0.54  and  u  =  0.70,  so  the  resulting  confidence  interval  is  (0.54,  0.70). 

Quick  exercise  24.1  Suppose  in  another  election  we  find  80  supporters  in  a 
sample  of  200.  Suppose  we  use  a  =  0.0456  for  which  Za/2  =  2.  Construct  the 
corresponding  confidence  interval  for  p. 


Remark  24.1  (Coverage  probabilities  and  an  alternative  method). 

Because  of  the  discrete  nature  of  the  binomial  distribution,  the  probabil¬ 
ity  that  the  confidence  interval  covers  the  true  parameter  value  depends 
on  p.  As  a  function  of  p  it  typically  oscillates  in  a  sawtooth-like  manner 
around  1  —  a,  being  too  high  for  some  values  and  too  low  for  others.  This 
is  something  that  cannot  be  escaped  from;  the  phenomenon  is  present  in 
every  method.  In  an  average  sense,  the  method  treated  in  the  text  yields 
coverage  probabilities  close  to  1  —  a,  though  for  arbitrarily  high  values  of  n 
it  is  possible  to  find  p’s  for  which  the  actual  coverage  is  several  percentage 
points  too  low.  The  low  coverage  occurs  for  p’s  near  0  and  1. 
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An  alternative  is  the  method  proposed  by  Agresti  and  Coull,  which  overall 
is  more  conservative  than  the  Wilson  method  (in  fact,  the  Agresti-Coull 
interval  contains  the  Wilson  interval  as  a  proper  subset).  Especially  for  p 
near  0  or  1  this  method  yields  conservative  confidence  intervals.  Define 


X  =  X  +  and  n  =  n  +  (2^/2)^, 


and  p  —  Xjn.  The  approximate  100(1  —  a)%  confidence  interval  is  then 
given  by 


For  a  clear  survey  paper  on  confidence  intervals  for  p  we  recommend  Brown 
et  al.  [4]. 


24.2  Is  there  a  general  method? 

We  have  now  seen  a  number  of  examples  of  confidence  intervals,  and  while  it 
should  be  clear  to  you  that  in  each  of  these  cases  the  resulting  intervals  are 
valid  confidence  intervals,  you  may  wonder  how  we  go  about  finding  confidence 
intervals  in  new  situations.  One  could  ask:  is  there  a  general  method?  We  first 
consider  an  example. 

A  confidence  interval  for  the  minimum  lifetime 

Suppose  we  have  a  random  sample  Xi , . . . ,  Xn  from  a  shifted  exponential 
distribution,  that  is,  Xi  =  8  +  Yi,  where  Yi, . . .  ,Yn  are  a  random  sample  from 
an  Exp{l)  distribution.  This  type  of  random  variable  is  sometimes  used  to 
model  lifetimes;  a  minimum  lifetime  is  guaranteed,  but  otherwise  the  lifetime 
has  an  exponential  distribution.  The  unknown  parameter  8  represents  the 
minimum  lifetime,  and  the  probability  density  of  the  Xi  is  positive  only  for 
values  greater  than  8. 

To  derive  information  about  8  it  is  natural  to  use  the  smallest  observed  value 
T  =  min{Ai, . . .  ,Xn\.  This  is  also  the  maximum  likelihood  estimator  for  5; 
see  Exercise  21.6.  Writing 


T  =  min{(5  +  Id, . . . ,  (5  +  E„}  =  5  +  minlld, . . . ,  Yn} 


and  observing  that  M  =  min{Ti, . . . ,  y„}  has  an  Exp(n)  distribution  (see 
Exercise  8.18),  we  find  for  the  distribution  function  of  T:  Fpia)  =  0  for  a  <  5 
and 


Fria)  =  P(T  <  a)  =  P{8  +  M  <  a)  =  P(M  <a-8) 

=  1  _  for  a  >8. 


(24.1) 


Next,  we  solve 


24.2  Is  there  a  general  method? 
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P(ci  <  T  <  c„)  =  1  -  a 

by  requiring 

P(T  <  Cl)  =  P(r  >  c„)  =  ia. 

Using  (24.1)  we  find  the  following  equations: 

1  -  =  ia  and 

whose  solutions  are 

Cl  =  S - In  (l  —  ^a)  and  Cu  =  S - In  (^a)  . 

Both  Cl  and  c„  are  values  larger  than  S,  because  the  logarithms  are  negative. 
We  have  found  that,  whatever  the  value  of  S: 

p((5-iln(l-ia)  <r  <5-iln(ia)^  =  1  -  a. 

By  rearranging  the  inequalities,  we  see  this  is  equivalent  to 

P  H —  In  (ia)  <  S  <T  -\ —  ln(l  —  =1  —  0, 

and  therefore  a  100(1  —  a)%  confidence  interval  for  S  is  given  by 

iln(ia)  ,  t+ iln(l- .  (24.2) 

For  a  =  0.05  this  becomes: 

^  ^  0.0253 \ 

\  n  ’  J 

Quick  exercise  24.2  Suppose  you  have  a  dataset  of  size  15  from  a  shifted 
Exp{l)  distribution,  whose  minimum  value  is  23.5.  What  is  the  99%  confidence 
interval  for  51 

Looking  back  at  the  example,  we  see  that  the  confidence  interval  could  be 
constructed  because  we  know  that  T  —  5  =  M  has  an  exponential  distribution. 
There  are  many  more  examples  of  this  type:  some  function  g{T,  0)  of  a  sample 
statistic  T  and  the  unknown  parameter  9  has  a  known  distribution.  However, 
this  still  does  not  cover  all  the  ways  to  construct  confidence  intervals  (see  also 
the  following  remark). 


Remark  24.2  (About  a  general  method).  Suppose  Xi,...,Xn  is  a 
random  sample  from  some  distribution  depending  on  some  unknown  pa¬ 
rameter  9  and  let  T  be  a  sample  statistic.  One  possible  choice  is  to  select 
a  T  that  is  an  estimator  for  9,  but  this  is  not  necessary.  In  each  case,  the 
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distribution  of  T  depends  on  6,  just  as  that  of  X\, . . .  ,Xn  does.  In  some 
cases  it  might  be  possible  to  find  functions  g{9)  and  h{9)  such  that 


P((/(0)  <  T  <  h{9))  —  1  —  a  for  every  value  of  9.  (24.3) 


If  this  is  so,  then  confidence  statements  about  9  can  be  made.  In  more 
special  cases,  for  example  if  g  and  h  are  strictly  increasing,  the  inequalities 
g{9)  <  T  <  h{9)  can  be  rewritten  as 


h-\T)  <9  <g-\T), 


and  then  (24.3)  is  equivalent  to 

P(h“^(r)  <9  <  g~^iT))  =  1  -  Q  for  every  value  of  9. 

Checking  with  the  confidence  interval  definition,  we  see  that  the  last  state¬ 
ment  implies  that  {h~^  (t),  g~^  (t))  is  a  100(1  — a)%  confidence  interval  for  9. 


24.3  One-sided  confidence  intervals 

Suppose  you  are  in  charge  of  a  power  plant  that  generates  and  sells  electricity, 
and  you  are  about  to  buy  a  shipment  of  coal,  say  a  shipment  of  the  Daw  Mill 
coal  identified  as  258GB41  earlier.  You  plan  to  buy  the  shipment  if  you  are 
confident  that  the  gross  calorific  content  exceeds  31.00  MJ/kg.  At  the  end  of 
Section  23.2  we  obtained  for  the  gross  calorific  content  the  95%  confidence 
interval  (30.946,31.067):  based  on  the  data  we  are  95%  confident  that  the 
gross  calorific  content  is  higher  than  30.946  and  lower  than  31.067. 

In  the  present  situation,  however,  we  are  only  interested  in  the  lower  bound: 
we  would  prefer  a  confidence  statement  of  the  type  “we  are  95%  confident 
that  the  gross  calorific  content  exceeds  31.00.”  Modifying  equation  (23.4)  we 


find 


which  is  equivalent  to 


We  conclude  that 


is  a  100(1  —  a)%  one-sided  confidence  interval  for  /r.  For  the  Daw  Mill  coal, 
using  a  =  0.05,  with  t2i,o.05  =  1-721  this  results  in: 
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We  see  that  because  “all  uncertainty  may  be  put  on  one  side,”  the  lower 
bound  in  the  one-sided  interval  is  higher  than  that  in  the  two-sided  one, 
though  still  below  31.00.  Other  situations  may  require  a  confidence  upper 
bound.  For  example,  if  the  calorific  value  is  below  a  certain  number  you  can 
try  to  negotiate  a  lower  the  price. 

The  definition  of  confidence  intervals  (page  343)  can  be  extended  to  include 
one-sided  confidence  intervals  as  well.  If  we  have  a  sample  statistic  such 
that 

P(L„  <  0)  =  7 

for  every  value  of  the  parameter  of  interest  0,  then 


(^n,  OO) 

is  called  a  1007%  one-sided  confidence  interval  for  9.  The  number  is 
sometimes  called  a  1007%  lower  confidence  bound  for  6.  Similary,  [/„  with 
P(0  <  Un)  =  7  for  every  value  of  6,  yields  the  one-sided  confidence  interval 
{—oo,Un),  and  is  called  a  1007%  uppsr  confidence  bound. 

Quick  exercise  24.3  Determine  the  99%  upper  confidence  bound  for  the 
gross  calorific  value  of  the  Daw  Mill  coal. 


24.4  Determining  the  sample  size 

The  narrower  the  confidence  interval  the  better  (why?).  As  a  general  prin¬ 
ciple,  we  know  that  more  accurate  statements  can  be  made  if  we  have  more 
measurements.  Sometimes,  an  accuracy  requirement  is  set,  even  before  data 
are  collected,  and  the  corresponding  sample  size  is  to  be  determined.  We  pro¬ 
vide  an  example  of  how  to  do  this  and  note  that  this  generally  can  be  done, 
but  the  actual  computation  varies  with  the  type  of  confidence  interval. 
Consider  the  question  of  the  calorific  content  of  coal  once  more.  We  have  a 
shipment  of  coal  to  test  and  we  want  to  obtain  a  95%  confidence  interval, 
but  it  should  not  be  wider  than  0.05  MJ/kg,  i.e.,  the  lower  and  upper  bound 
should  not  differ  more  than  0.05.  How  many  measurements  do  we  need? 

We  answer  this  question  for  the  case  when  ISO  method  1928  is  used,  whence 
we  may  assume  that  measurements  are  normally  distributed  with  standard 
deviation  a  =  0.1.  When  the  desired  confidence  level  is  1  —  a,  the  width  of 
the  confidence  interval  will  be 

2  •  Zal2—j=. 

Vn 

Requiring  that  this  is  at  most  w  means  finding  the  smallest  n  that  satisfies 

a 

ZZa/2—j=  <  W 
yn 
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or 


n  > 


2Zr 


!/20’ 


W 


For  the  example:  w  =  0.05,  a  =  0.1,  and  zo.025  =  1.96;  so 

2 


n  > 


2-  1.96-0.1 
(105 


=  61.4, 


that  is,  we  should  perform  at  least  62  measurements. 

In  case  a  is  unknown,  we  somehow  have  to  estimate  it,  and  then  the  method 
can  only  give  an  indication  of  the  required  sample  size.  The  standard  deviation 
as  we  (afterwards)  estimate  it  from  the  data  may  turn  out  to  be  quite  different, 
and  the  obtained  confidence  interval  may  be  smaller  or  larger  than  intended. 


Quick  exercise  24.4  What  is  the  required  sample  size  if  we  want  the  99% 
confidence  interval  to  be  0.05  MJ/kg  wide? 


24.5  Solutions  to  the  quick  exercises 


24.1  We  need  to  solve 


80 


-P  - 


200 


p{l  —  p)<0,  or  1.02 —  0.82p+ 0.16  <  0. 


The  solutions  are: 


^  ^  -(-0.82)  ±y(-0.82F- 4. 1.02. 0.16  ^  ^ 

SO  the  confidence  interval  is  (0.33,  0.47). 

24.2  We  should  substitute  n  =  15,  t  =  23.5,  and  a  =  0.01  into: 


|^t+iln(ia),f+iln(l-ia)^ 


which  yields 


23.5- 


5.30 


23.5- 


0.0050\ 

) 


=  (23.1467,  23.4997). 


24.3  The  upper  confidence  bound  is  given  by 

.Q 

Iln  —  T  ^21,0.01 


where  =  31.012,  t2i,o.oi  =  2.518,  and  =  0.1294.  Substitution  yields 
=  31.081. 
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24.4  The  confidence  level  changes  to  99%,  so  we  use  zo.oos  =  2.576  instead 
of  1.96  in  the  computation: 


so  we  need  at  least  107  measurements. 


24.6  Exercises 

24.1  □  Of  a  series  of  100  (independent  and  identical)  chemical  experiments, 
70  were  concluded  succesfully.  Construct  a  90%  confidence  interval  for  the 
success  probability  of  this  type  of  experiment. 

24.2  In  January  2002  the  Euro  was  introduced  and  soon  after  stories  started 
to  circulate  that  some  of  the  Euro  coins  would  not  be  fair  coins,  because  the 
“national  side”  of  some  coins  would  be  too  heavy  or  too  light  (see,  for  example, 
the  New  Scientist  of  January  4,  2002,  but  also  national  newspapers  of  that 
date). 

a.  A  French  1  Euro  coin  was  tossed  six  times,  resulting  in  1  heads  and  5  tails. 
Is  it  reasonable  to  use  the  Wilson  method,  introduced  in  Section  24.1,  to 
construct  a  confidence  interval  for  pi 

b.  A  Belgian  1  Euro  coin  was  tossed  250  times:  140  heads  and  110  tails. 
Construct  a  95%  confidence  interval  for  the  probability  of  getting  heads 
with  this  coin. 

24.3  In  Exercise  23.1,  what  sample  size  is  needed  if  we  want  a  99%  confidence 
interval  for  p  at  most  1  ml  wide? 

24.4  □  Recall  Exercise  23.3  and  the  10  bags  of  cement  that  should  each  weigh 
94  kg.  The  average  weight  was  93.5  kg,  with  sample  standard  deviation  0.75. 

a.  Based  on  these  data,  how  many  bags  would  you  need  to  sample  to  make 
a  90%  confidence  interval  that  is  0.1  kg  wide? 

b.  Suppose  you  actually  do  measure  the  required  number  of  bags  and  con¬ 
struct  a  new  confidence  interval.  Is  it  guaranteed  to  be  at  most  0.1  kg 
wide? 

24.5  Suppose  we  want  to  make  a  95%  confidence  interval  for  the  probability 
of  getting  heads  with  a  Dutch  1  Euro  coin,  and  it  should  be  at  most  0.01 
wide.  To  determine  the  required  sample  size,  we  note  that  the  probability  of 
getting  heads  is  about  0.5.  Furthermore,  if  X  has  a  Bin{n,p)  distribution, 
with  n  large  and  p  «  0.5,  then 


370 


24  More  on  confidence  intervals 


X  —  np 

VW4 


is  approximately  standard  normal. 


a. 


Use  this  statement  to  derive  that  the  width  of  the  95%  confidence  interval 
for  p  is  approximately 


2^0.025 


Use  this  width  to  determine  how  large  n  should  be. 
b.  The  coin  is  thrown  the  number  of  times  just  computed,  resulting  in  19477 
times  heads.  Construct  the  95%  confidence  interval  and  check  whether  the 
required  accuracy  is  attained. 


24.6  ffl  Environmentalists  have  taken  16  samples  from  the  wastewater  of  a 
chemical  plant  and  measured  the  concentration  of  a  certain  carcinogenic  sub¬ 
stance.  They  found  xiq  =  2.24  (ppm)  and  sfg  =  1.12,  and  want  to  use  these 
data  in  a  lawsuit  against  the  plant.  It  may  be  assumed  that  the  data  are  a 
realization  of  a  normal  random  sample. 


a.  Construct  the  97.5%  one-sided  confidence  interval  that  the  environmen¬ 
talists  made  to  convince  the  judge  that  the  concentration  exceeds  legal 
limits. 

b.  The  plant  management  uses  the  same  data  to  construct  a  97.5%  one¬ 
sided  confidence  interval  to  show  that  concentrations  are  not  too  high. 
Construct  this  interval  as  well. 


24.7  Consider  once  more  the  Rutherford-Geiger  data  as  given  in  Section  23.4. 
Knowing  that  the  number  of  a-particle  emissions  during  an  interval  has  a 
Poisson  distribution,  we  may  see  the  data  as  observations  from  a  Pois{p) 
distribution.  The  central  limit  theorem  tells  us  that  the  average  of  a  large 
number  of  independent  Pois  {p)  approximately  has  a  normal  distribution  and 
hence  that 

Xn  p 

^/^/n 

has  a  distribution  that  is  approximately  A^(0, 1). 

a.  Show  that  the  large  sample  95%  confidence  interval  contains  those  values 
of  p  for  which 

{xn  -  pf  <  (1.96)^-. 

n 

b.  Use  the  result  from  a  to  construct  the  large  sample  95%  confidence  interval 
based  on  the  Rutherford-Geiger  data. 

c.  Compare  the  result  with  that  of  Exercise  23.9  b.  Is  this  surprising? 

24.8  □  Recall  Exercise  23.5  about  the  1500  m  speed-skating  results  in  the  2002 
Winter  Olympic  Games.  If  there  were  no  outer  lane  advantage,  the  number 
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out  of  the  23  completed  races  won  by  skaters  starting  in  the  outer  lane  would 
have  a  Bin(2S,p)  distribution  with  p  =  1/2,  because  of  the  lane  assignment 
by  lottery. 

a.  Of  the  23  races,  15  were  won  by  the  skater  starting  in  the  outer  lane.  Use 
this  information  to  construct  a  95%  confidence  interval  for  p  by  means 
of  the  Wilson  method.  If  you  think  that  n  =  23  is  probably  too  small  to 
use  a  method  based  on  the  central  limit  theorem,  we  agree.  We  should  be 
careful  with  conclusions  we  draw  from  this  confidence  interval. 

b.  The  question  posed  earlier  “Is  there  an  outer  lane  advantage?”  implies  that 
a  one-sided  confidence  interval  is  more  suitable.  Construct  the  appropriate 
95%  one-sided  confidence  interval  for  p  by  first  constructing  a  90%  two- 
sided  confidence  interval. 

24.9  ffl  Suppose  we  have  a  dataset  xi, . . .  ,xi2  that  may  be  modeled  as  the 
realization  of  a  random  sample  Xi, . . . ,  X12  from  a  U(0,  9)  distribution,  with 
9  unknown.  Let  M  =  maxjXi, . . . ,  X12}. 

a.  Show  that  for  0  <  t  <  1 


b.  Use  Of  =  0.1  and  solve 


c.  Suppose  the  realization  of  M  is  to  =  3.  Construct  the  90%  confidence 
interval  for  9. 

d.  Derive  the  general  expression  for  a  confidence  interval  of  level  1  —  a  based 
on  a  sample  of  size  n. 

24.10  Suppose  we  have  a  dataset  xi,...,Xn  that  may  be  modeled  as  the 
realization  of  a  random  sample  , . . . ,  from  an  Exp  (A)  distribution,  where 
A  is  unknown.  Let  Sn  =  Xi  -I-  •  ■  ■  -I- 

a.  Check  that  AS'„  has  a  Gam(n,  1)  distribution. 

b.  The  following  quantiles  of  the  Gam (20,1)  distribution  are  given:  go. 05  = 
13.25  and  go. 95  =  27.88.  Use  these  to  construct  a  90%  confidence  interval 
for  A  when  n  =  20. 
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Testing  hypotheses:  essentials 


The  statistical  methods  that  we  have  discussed  until  now  have  been  devel¬ 
oped  to  infer  knowledge  about  certain  features  of  the  model  distribution  that 
represent  our  quantities  of  interest.  These  inferences  often  take  the  form  of 
numerical  estimates,  as  either  single  numbers  or  confidence  intervals.  How¬ 
ever,  sometimes  the  conclusion  to  be  drawn  is  not  expressed  numerically,  but 
is  concerned  with  choosing  between  two  conflicting  theories,  or  hypotheses. 
For  instance,  one  has  to  assess  whether  the  lifetime  of  a  certain  type  of  ball 
bearing  deviates  or  does  not  deviate  from  the  lifetime  guaranteed  by  the  man¬ 
ufacturer  of  the  bearings;  an  engineer  wants  to  know  whether  dry  drilling  is 
faster  or  the  same  as  wet  drilling;  a  gynecologist  wants  to  find  out  whether 
smoking  affects  or  does  not  affect  the  probability  of  getting  pregnant;  the  Al¬ 
lied  Forces  want  to  know  whether  the  German  war  production  is  equal  to  or 
smaller  than  what  Allied  intelligence  agencies  reported.  The  process  of  formu¬ 
lating  the  possible  conclusions  one  can  draw  from  an  experiment  and  choosing 
between  two  alternatives  is  known  as  hypothesis  testing.  In  this  chapter  we 
start  to  explore  this  statistical  methodology. 


25.1  Null  hypothesis  and  test  statistic 

We  will  introduce  the  basic  concepts  of  hypothesis  testing  with  an  exam¬ 
ple.  Let  us  return  to  the  analysis  of  German  war  equipment.  During  World 
War  II  the  Allied  Forces  received  reports  by  the  Allied  intelligence  agencies 
on  German  war  production.  The  numbers  of  produced  tires,  tanks,  and  other 
equipment,  as  claimed  in  these  reports,  were  a  lot  higher  than  indicated  by 
the  observed  serial  numbers.  The  objective  was  to  decide  whether  the  actual 
produced  quantities  were  smaller  than  the  ones  reported. 

For  simplicity  suppose  that  we  have  observed  tanks  with  (recoded)  serial  num¬ 
bers 


61  19  56  24  16. 
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Furthermore,  suppose  that  the  Allied  intelligence  agencies  report  a  production 
of  350  tanks  d  This  is  a  lot  more  than  we  would  surmise  from  the  observed 
data.  We  want  to  choose  between  the  proposition  that  the  total  number  of 
tanks  is  350  and  the  proposition  that  the  total  number  is  smaller  than  350. 
The  two  competing  propositions  are  called  null  hypothesis,  denoted  by  Hg,  and 
alternative  hypothesis,  denoted  by  Hi.  The  way  we  go  about  choosing  between 
Hq  and  Hi  is  conceptually  similar  to  the  way  a  jury  deliberates  in  a  court 
trial.  The  null  hypothesis  corresponds  to  the  position  of  the  defendant:  just 
as  he  is  presumed  to  be  innocent  until  proven  guilty,  so  is  the  null  hypothesis 
presumed  to  be  true  until  the  data  provide  convincing  evidence  against  it. 
The  alternative  hypothesis  corresponds  to  the  charges  brought  against  the 
defendant. 

To  decide  whether  Hq  is  false  we  use  a  statistical  model.  As  argued  in  Chap¬ 
ter  20  the  (recoded)  serial  numbers  are  modeled  as  a  realization  of  random 
variables  Xi,  X2,  ■  ■  ■ ,  A5  representing  five  draws  without  replacement  from  the 
numbers  1,  2, . . . ,  A^.  The  parameter  N  represents  the  total  number  of  tanks. 
The  two  hypotheses  in  question  are 

Ho:N  =  350 
Hi:N<  350. 

If  we  reject  the  null  hypothesis  we  will  accept  Hi;  we  speak  of  rejecting  Hq 
in  favor  of  Hi.  Usually,  the  alternative  hypothesis  represents  the  theory  or 
belief  that  we  would  like  to  accept  if  we  do  reject  Hq.  This  means  that  we 
must  carefully  choose  Hi  in  relation  with  our  interests  in  the  problem  at  hand. 
In  our  example  we  are  particularly  interested  in  whether  the  number  of  tanks 
is  less  than  350;  so  we  test  the  null  hypothesis  against  Hi  :  N  <  350.  If  we 
would  be  interested  in  whether  the  number  of  tanks  differs  from  350,  or  is 
greater  than  350,  we  would  test  against  Hi  :  N  ^  350  or  Hi  :  N  >  350. 

Quick  exercise  25.1  In  the  drilling  example  from  Sections  15.5  and  16.4  the 
data  on  drill  times  for  dry  drilling  are  modeled  as  a  realization  of  a  random 
sample  from  a  distribution  with  expectation  p,i,  and  similarly  the  data  for  wet 
drilling  correspond  to  a  distribution  with  expectation  p,2.  We  want  to  know 
whether  dry  drilling  is  faster  than  wet  drilling.  To  this  end  we  test  the  null 
hypothesis  Hq  :  pi  =  p,2  (the  drill  time  is  the  same  for  both  methods).  What 
would  you  choose  for  Hi? 

The  next  step  is  to  select  a  criterion  based  on  Xi,X2, . . . ,  Xq  that  provides  an 
indication  about  whether  Hq  is  false.  Such  a  criterion  involves  a  test  statistic. 

^  This  may  seem  ridiculous.  However,  when  after  the  war  official  German  produc¬ 
tion  statistics  became  available,  the  average  monthly  production  of  tanks  during 
the  period  1940-1943  was  342.  During  the  war  this  number  was  estimated  at  327, 
whereas  Allied  intelligence  reported  1550!  (see  [27]). 
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Test  Statistic.  Suppose  the  dataset  is  modeled  as  the  realization 
of  random  variables  Xi,  X2, . . . ,  Xn-  A  test  statistic  is  any  sample 
statistic  T  =  h{Xi,  X2,  ■  ■  ■ ,  Xn),  whose  numerical  value  is  used  to 
decide  whether  we  reject  Hq. 


In  the  tank  example  we  use  the  test  statistic 

T  =  max{Ai,  X2, . . . ,  X^}. 

Having  chosen  a  test  statistic  T,  we  investigate  what  sort  of  values  T  can 
attain.  These  values  can  be  viewed  on  a  credibility  scale  for  Hq,  and  we  must 
determine  which  of  these  values  provide  evidence  in  favor  of  Hq,  and  which 
provide  evidence  in  favor  of  Hi.  First  of  all  note  that  if  we  find  a  value  of 
T  larger  than  350,  we  immediately  know  that  Hq  as  well  as  Hi  is  false.  If 
this  happens,  we  actually  should  be  considering  another  testing  problem,  but 
for  the  current  problem  of  testing  Hq  :  N  =  350  against  Hi  :  N  <  350  such 
values  are  irrelevant.  Hence  the  possible  values  of  T  that  are  of  interest  to  us 
are  the  integers  from  5  to  350. 

If  Hq  is  true,  then  what  is  a  typical  value  for  T  and  what  is  not?  Remember 
from  Section  20.1  that,  because  n  =  5,  the  expectation  of  T  is  E  [T]  =  |(H+1). 
This  means  that  the  distribution  of  T  is  centered  around  |(iV  +  1).  Hence,  if 
Hq  is  true,  then  typical  values  of  T  are  in  the  neighborhood  of  |  •  351  =  292.5. 
Values  of  T  that  deviate  a  lot  from  292.5  are  evidence  against  Hq.  Values  that 
are  much  greater  than  292.5  are  evidence  against  Hq  but  provide  even  stronger 
evidence  against  Hi.  For  such  values  we  will  not  reject  Hq  in  favor  of  Hi.  Also 
values  a  little  smaller  than  292.5  are  grounds  not  to  reject  Hq,  because  we  are 
committed  to  giving  Hq  the  benefit  of  the  doubt.  On  the  other  hand,  values 
of  T  very  close  to  5  should  be  considered  as  strong  evidence  against  the  null 
hypothesis  and  are  in  favor  of  Hi,  hence  they  lead  to  a  decision  to  reject  Hq. 
This  is  summarized  in  Figure  25.1. 


Values  in  Values  in  Values  against 

favor  of  Hi  favor  of  Hq  both  Hq  and  Hi 

I - 1 - 1 . 

5  292.5  350 

Fig.  25.1.  Values  of  the  test  statistic  T. 


Quick  exercise  25.2  Another  possible  test  statistic  would  be  Xq.  If  we  use 
its  values  as  a  credibility  scale  for  Hq,  then  what  are  the  possible  values  of 
Xq,  which  values  of  X5  are  in  favor  of  Hi  :  N  <  350,  and  which  values  are  in 
favor  of  Hq  :  N  =  350? 


376  25  Testing  hypotheses:  essentials 

For  the  data  we  find 


t  =  max{61, 19,  56, 24, 16}  =  61 

as  the  realization  of  the  test  statistic.  How  do  we  use  this  to  decide  on  Hq? 


25.2  Tail  probabilities 


As  we  have  just  seen,  if  Hq  is  true,  then  typical  values  of  T  are  in  the  neighbor¬ 
hood  of  I  •  351  =  292.5.  In  view  of  Figure  25.1,  the  more  a  value  of  T  is  to  the 
left,  the  stronger  evidence  it  provides  in  favor  of  Hi.  The  value  61  is  in  the  left 
region  of  Figure  25.1.  Can  we  now  reject  Hq  and  conclude  that  N  is  smaller 
than  350,  or  can  the  fact  that  we  observe  61  as  maximum  be  attributed  to 
chance?  In  courtroom  terminology:  can  we  reach  the  conclusion  that  the  null 
hypothesis  is  false  beyond  reasonable  doubt!  One  way  to  investigate  this  is  to 
examine  how  likely  it  is  that  one  would  observe  a  value  of  T  that  provides 
even  stronger  evidence  against  Hq  than  61,  in  the  situation  that  N  =  350.  If 
this  is  very  unlikely,  then  61  already  bears  strong  evidence  against  Hq. 

Values  of  T  that  provide  stronger  evidence  against  Hq  than  61  are  to  the 
left  of  61.  Therefore  we  compute  P(T  <  61).  In  the  situation  that  N  =  350, 
the  test  statistic  T  is  the  maximum  of  5  numbers  drawn  without  replacement 
from  1,2,...,  350.  We  find  that 


P(T  <  61) 


P(max{Ai,A2,...,A5}  <  61) 


61  60 

^  ^ 
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0.00014. 


This  probability  is  so  small  that  we  view  the  value  61  as  strong  evidence 
against  the  null  hypothesis.  Indeed,  if  the  null  hypothesis  would  be  true,  then 
values  of  T  that  would  provide  the  same  or  even  stronger  evidence  against  Hq 
than  61  are  very  unlikely  to  occur,  i.e.,  they  occur  with  probability  0.00014! 
In  other  words,  the  observed  value  61  is  exceptionally  small  in  case  Hq  is  true. 
At  this  point  we  can  do  two  things:  either  we  believe  that  Hq  is  true  and 
that  something  very  unlikely  has  happened,  or  we  believe  that  events  with 
such  a  small  probability  do  not  happen  in  practice,  so  that  T  <  61  could 
only  have  occurred  because  Hq  is  false.  We  choose  to  believe  that  things 
happening  with  probability  0.00014  are  so  exceptional  that  we  reject  the  null 
hypothesis  Hq  :  N  =  350  in  favor  of  the  alternative  hypothesis  Hi  :  N  <  350. 
In  courtroom  terminology:  we  say  that  a  value  of  T  smaller  than  or  equal  to 
61  implies  that  the  null  hypothesis  is  false  beyond  reasonable  doubt. 


P-values 

In  our  example,  the  more  a  value  of  T  is  to  the  left,  the  stronger  evidence 
it  provides  against  Hq.  For  this  reason  we  computed  the  left  tail  probability 
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P(T  <  61).  In  other  situations,  the  direction  in  which  values  of  T  provide 
stronger  evidence  against  Hq  may  be  to  the  right  of  the  observed  value  t, 
in  which  case  one  would  compute  a  right  tail  probability  P(T  >  t).  In  both 
cases  the  tail  probability  expresses  how  likely  it  is  to  obtain  a  value  of  the 
test  statistic  T  at  least  as  extreme  as  the  value  t  observed  for  the  data.  Such 
a  probability  is  called  a  p- value.  In  a  way,  the  size  of  the  p- value  reflects  how 
much  evidence  the  observed  value  t  provides  against  Hq.  The  smaller  the 
p- value,  the  stronger  evidenee  the  observed  value  t  bears  against  Hq. 

The  phrase  “at  least  as  extreme  as  the  observed  value  t”  refers  to  a  particular 
direction,  namely  the  direction  in  which  values  of  T  provide  stronger  evidence 
against  Hq  and  in  favor  of  iJi.  In  our  example,  this  was  to  the  left  of  61,  and 
the  p-value  corresponding  to  61  was  P(T  <  61)  =  0.00014.  In  this  case  it  is 
clear  what  is  meant  by  “at  least  as  extreme  as  t”  and  which  tail  probability 
corresponds  to  the  p-value.  However,  in  some  testing  problems  one  can  deviate 
from  Hq  in  both  directions.  In  such  cases  it  may  not  be  clear  what  values  of 
T  are  at  least  as  extreme  as  the  observed  value,  and  it  may  be  unclear  how 
the  p-value  should  be  computed.  One  approach  to  a  solution  in  this  case  is 
to  simply  compute  the  one-tailed  p-value  that  corresponds  to  the  direction  in 
which  t  deviates  from  Hq. 

Quick  exercise  25.3  Suppose  that  the  Allied  intelligence  agencies  had  re¬ 
ported  a  production  of  80  tanks,  so  that  we  would  test  Hq  :  N  =  80  against 
Hi  :  N  <  80.  Compute  the  p-value  corresponding  to  61.  Would  you  conclude 
Hq  is  false  beyond  reasonable  doubt? 
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Suppose  that  the  maximum  is  200  instead  of  61.  This  is  also  to  the  left  of 
the  expected  value  292.5  of  T.  Is  it  far  enough  to  the  left  to  reject  the  null 
hypothesis?  In  this  case  the  p-value  is  equal  to 


P(r  <  200) 


P(max{Ai,A2,...,X5}  <  200) 


200  199 

^  ^ 


196 


0.0596. 


This  means  that  if  the  total  number  of  produced  tanks  is  350,  then  in  5.96% 
of  all  cases  we  would  observe  a  value  of  T  that  is  at  least  as  extreme  as  the 
value  200.  Before  we  decide  whether  0.0596  is  small  enough  to  reject  the  null 
hypothesis  let  us  explore  in  more  detail  what  the  preceding  probability  stands 
for. 


It  is  important  to  distinguish  between  (1)  the  true  state  of  nature:  Hq  is  true 
or  Hi  is  true  and  (2)  our  deeision:  we  reject  or  do  not  reject  Hq  on  the  basis 
of  the  data.  In  our  example  the  possibilities  for  the  true  state  of  nature  are: 


•  Hq  is  true,  i.e.,  there  are  350  tanks  produced. 

•  Hi  is  true,  i.e.,  the  number  of  tanks  produced  is  less  than  350. 
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We  do  not  know  in  which  situation  we  are.  There  are  two  possible  decisions: 

•  We  reject  Hq  in  favor  of  Hi. 

•  We  do  not  reject  Hq. 

This  leads  to  four  possible  situations,  which  are  summarized  in  Figure  25.2. 


True  state  of  nature 

Hq  is  true 

Hi  is  true 

Our  decision  on  the 

basis  of  the  data 

Reject  Hq 

Type  I  error 

Correct  decision 

Not  reject  Hq 

Correct  decision 

Type  H  error 

Fig.  25.2.  Four  situations  when  deciding  about  Hq. 


There  are  two  situations  in  which  the  decision  made  on  the  basis  of  the  data  is 
wrong.  The  null  hypothesis  Hq  may  be  true,  whereas  the  data  lead  to  rejection 
of  Hq.  On  the  other  hand,  the  alternative  hypothesis  Hi  may  be  true,  whereas 
we  do  not  reject  Hq  on  the  basis  of  the  data.  These  wrong  decisions  are  called 
type  I  and  type  II  errors. 


Type  I  and  II  errors.  A  type  I  error  occurs  if  we  falsely  reject 
Hq.  a  type  H  error  occurs  if  we  falsely  do  not  reject  Hq. 


In  courtroom  terminology,  a  type  I  error  corresponds  to  convicting  an  innocent 
defendant,  whereas  a  type  II  error  corresponds  to  acquitting  a  criminal. 

li  Hq  ■.  N  =  350  is  true,  then  the  decision  to  reject  Hq  is  a  type  I  error.  We 
will  never  know  whether  we  make  a  type  I  error.  However,  given  a  particular 
decision  rule,  we  can  say  something  about  the  probability  of  committing  a 
type  I  error.  Suppose  the  decision  rule  would  be  “reject  Hq  :  N  =  350  when¬ 
ever  T  <  200.”  With  this  decision  rule  the  probability  of  committing  a  type  I 
error  is  P(T  <  200)  =  0.0596.  If  we  are  willing  to  run  the  risk  of  committing 
a  type  I  error  with  probability  0.0596,  we  could  adopt  this  decision  rule.  This 
would  also  mean  that  on  the  basis  of  an  observed  maximum  of  200  we  would 
reject  Hq  in  favor  of  Hi  :  N  <  350. 

Quick  exercise  25.4  Suppose  we  adopt  the  following  decision  rule  about  the 
null  hypothesis:  “reject  Hq  :  N  =  350  whenever  T  <  250.”  Using  this  decision 
rule,  what  is  the  probability  of  committing  a  type  I  error? 
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The  question  remains  what  amount  of  risk  one  is  willing  to  take  to  falsely 
reject  Hq,  or  in  courtroom  terminology:  how  small  should  the  p- value  be  to 
reach  a  conclusion  that  is  “beyond  reasonable  doubt”?  In  many  situations, 
as  a  rule  of  thumb  0.05  is  used  as  the  level  where  reasonable  doubt  begins. 
Something  happening  with  probability  less  than  or  equal  to  0.05  is  then  viewed 
as  being  too  exceptional.  However,  there  is  no  general  rule  that  specifies  how 
small  the  p- value  must  be  to  reject  Hq.  There  is  no  way  to  argue  that  this 
probability  should  be  below  0.10  or  0.18  or  0.009 — or  anything  else. 

A  possible  solution  is  to  solely  report  the  p-value  corresponding  to  the  ob¬ 
served  value  of  the  test  statistic.  This  is  objective  and  does  not  have  the 
arbitrariness  of  a  preselected  level  such  as  0.05.  An  investigator  who  reports 
the  p-value  conveys  the  maximum  amount  of  information  contained  in  the 
dataset  and  permits  all  decision  makers  to  choose  their  own  level  and  make 
their  own  decision  about  the  null  hypothesis.  This  is  especially  important 
when  there  is  no  justifiable  reason  for  preselecting  a  particular  value  for  such 
a  level. 


25.4  Solutions  to  the  quick  exercises 

25.1  One  is  interested  in  whether  dry  drilling  is  faster  than  wet  drilling. 

Hence  if  we  reject  Hq  :  =  p2,  we  would  like  to  conclude  that  the  drill  time 

is  smaller  for  dry  drilling  than  for  wet  drilling.  Since  pi  and  /i2  represent  the 
drill  time  for  dry  and  wet  drilling,  we  should  choose  Hi  :  pi  <  /i2- 

25.2  The  value  of  X5  is  at  least  3  and  if  we  find  a  value  of  X5  that  is  larger 
than  348,  then  at  least  one  of  the  five  numbers  must  be  greater  than  350,  so 
that  we  immediately  know  that  Hq  as  well  as  Hi  is  false.  Hence  the  possible 
values  of  X5  that  are  relevant  for  our  testing  problem  are  between  3  and  348. 
We  know  from  Section  20.1  that  2X5  —  1  is  an  unbiased  estimator  for  N, 
no  matter  what  the  value  of  N  is.  This  implies  that  values  of  X^  itself  are 
centered  around  {N  -|-  l)/2.  Hence  values  close  to  351/2=175.5  are  in  favor 
of  Hq,  whereas  values  close  to  3  are  in  favor  of  Hi.  Values  close  to  348  are 
against  Hq,  but  also  against  Hi.  See  Figure  25.3. 

Values  in  Values  in  Values  against 

favor  of  Hi  favor  of  Hq  both  Hq  and  Hi 

I - 1 - 1 . 

3  175.5  348 

Fig.  25.3.  Values  of  the  test  statistic  X5. 


25.3  The  p-value  corresponding  to  61  is  now  equal  to 
P(r  <  61)  =  §.  f§  ...  =  0.2475. 


380  25  Testing  hypotheses:  essentials 


If  Hq  is  true,  then  in  24.75%  of  the  time  one  will  observe  a  value  T  less  than 
or  equal  to  61.  Such  values  are  not  exceptionally  small  for  T  under  and 
therefore  the  evidence  that  the  value  61  bears  against  Hq  is  pretty  weak.  We 
cannot  reject  Hq  beyond  reasonable  doubt. 

25.4  The  type  I  error  associated  with  the  decision  rule  occurs  if  TV  =  350 
{Hq  is  true)  and  t  <  250  (reject  Hq).  The  probability  that  this  happens  is 
P(T<  250)  =  iS- If  •••111  =  0.1838. 


25.5  Exercises 

25.1  In  a  study  about  train  delays  in  The  Netherlands  one  was  interested  in 
whether  arrival  delays  of  trains  exhibit  more  variation  during  rush  hours  than 
during  quiet  hours.  The  observed  arrival  delays  during  rush  hours  are  mod¬ 
eled  as  realizations  of  a  random  sample  from  a  distribution  with  variance 
and  similarly  the  observed  arrival  delays  during  quiet  hours  correspond  to  a 
distribution  with  variance  cr|.  One  tests  the  null  hypothesis  Hq  :  ai  = 
What  do  you  choose  as  the  alternative  hypothesis? 

25.2  □  On  average,  the  number  of  babies  born  in  Cleveland,  Ohio,  in  the 
month  of  September  is  1472.  On  January  26,  1977,  the  city  was  immobilized 
by  a  blizzard.  Nine  months  later,  in  September  1977,  the  recorded  number  of 
births  was  1718.  Can  the  increase  of  246  be  attributed  to  chance?  To  inves¬ 
tigate  this,  the  number  of  births  in  the  month  of  September  is  modeled  by  a 
Poisson  random  variable  with  parameter  /i,  and  we  test  Hq  :  ^  =  1472.  What 
would  you  choose  as  the  alternative  hypothesis? 

25.3  Recall  Exercise  17.9  about  black  cherry  trees.  The  scatterplot  of  y  (vol¬ 
ume)  versus  x  =  dPh  (squared  diameter  times  height)  seems  to  indicate  that 
the  regression  line  y  =  a  +  fix  runs  through  the  origin.  One  wants  to  inves¬ 
tigate  whether  this  is  true  by  means  of  a  testing  problem.  Formulate  a  null 
hypothesis  and  alternative  hypothesis  in  terms  of  (one  of)  the  parameters  a 
and  [3. 

25.4  ffl  Consider  the  example  from  Section  4.4  about  the  number  of  cycles 
up  to  pregnancy  of  smoking  and  nonsmoking  women.  Suppose  the  observed 
number  of  cycles  are  modeled  as  realizations  of  random  samples  from  geo¬ 
metric  distributions.  Let  pi  be  the  parameter  of  the  geometric  distribution 
corresponding  to  smoking  women  and  p2  be  the  parameter  for  the  nonsmok¬ 
ing  women.  We  are  interested  in  whether  pi  is  different  from  p2,  and  we 
investigate  this  by  testing  Hq  '■  pi  =  P2  against  iJi  :  pi  yf  p2- 

a.  If  the  data  are  as  given  in  Exercise  17.5,  what  would  you  choose  as  a  test 
statistic? 
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b.  What  would  you  choose  as  a  test  statistic,  if  you  were  given  the  extra 
knowledge  as  in  Table  21.1? 

c.  Suppose  we  are  interested  in  whether  smoking  women  are  less  likely  to  get 
pregnant  than  nonsmoking  women.  What  is  the  appropriate  alternative 
hypothesis  in  this  case? 

25.5  □  Suppose  a  dataset  is  a  realization  of  a  random  sample  Xi,  X2, . . . , 
from  a  uniform  distribution  on  [0,0],  for  some  (unknown)  0  >  0.  We  test 
Hq  :  9  =  5  versus  Hi  :  9  ^  b. 

a.  We  take  Ti  =  maxjXi,  X2, . . . ,  X„}  as  our  test  statistic.  Specify  what 
the  (relevant)  possible  values  are  for  T  and  which  are  in  favor  of  Hq  and 
which  are  in  favor  of  Hi.  For  instance,  make  a  picture  like  Figure  25.1. 

b.  Same  as  a,  but  now  for  test  statistic  T2  =  \2Xn  —  5|. 

25.6  □  To  test  a  certain  null  hypothesis  Hq  one  uses  a  test  statistic  T  with 
a  continuous  sampling  distribution.  One  agrees  that  Hq  is  rejected  if  one 
observes  a  value  t  of  the  test  statistic  for  which  (under  Hq)  the  right  tail 
probability  P(T  >  t)  is  smaller  than  or  equal  to  0.05.  Given  below  are  different 
values  t  and  a  corresponding  left  or  right  tail  probability  (under  Hq).  Specify 
for  each  case  what  the  p- value  is,  if  possible,  and  whether  we  should  reject  Hq. 


a. 

t  = 

2.34 

and 

P(T 

> 

2.34)  = 

=  0.23. 

b. 

t  = 

2.34 

and 

P(T 

< 

2.34)  = 

=  0.23. 

c. 

t  = 

0.03 

and 

P(T 

> 

0.03)  = 

=  0.968 

d. 

t  = 

1.07 

and 

P(T 

< 

1.07)  = 

=  0.981 

e. 

t  = 

1.07 

and 

P(T 

< 

2.34)  = 

=  0.01. 

f. 

t  = 

2.34 

and 

P(T 

< 

1.07)  = 

=  0.981, 

g- 

t  = 

2.34 

and 

P(T 

< 

1.07)  = 

=  0.800, 

25.7  (Exercise  25.2  continued).  The  number  of  births  in  September  is  mod¬ 
eled  by  a  Poisson  random  variable  T  with  parameter  p,  which  represents  the 
expected  number  of  births.  Suppose  that  one  uses  T  to  test  the  null  hypothe¬ 
sis  Hq  :  fi=  1472  and  that  one  decides  to  reject  Hq  on  the  basis  of  observing 
the  value  t  =  1718. 

a.  In  which  direction  do  values  of  T  provide  evidence  against  Hq  (and  in 
favor  of  Hi)? 

b.  Compute  the  p- value  corresponding  to  t  =  1718,  where  you  may  use  the 
fact  that  the  distribution  of  T  can  be  approximated  by  an  N {fj,,  p)  distri¬ 
bution. 

25.8  Suppose  we  want  to  test  the  null  hypothesis  that  our  dataset  is  a  realiza¬ 
tion  of  a  random  sample  from  a  standard  normal  distribution.  As  test  statistic 
we  use  the  Kolmogorov-Smirnov  distance  between  the  empirical  distribution 
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function  of  the  data  and  the  distribution  function  of  the  standard  nor¬ 
mal: 

T  =  sup  \Fn{a)  -  ^>(0)1. 

a^K. 

What  are  the  possible  values  of  T  and  in  which  direction  do  values  of  T  deviate 
from  the  null  hypothesis? 

25.9  Recall  the  example  from  Section  18.3,  where  we  investigated  whether  the 
software  data  are  exponential  by  means  of  the  Kolmogorov-Smirnov  distance 
between  the  empirical  distribution  function  Fn  of  the  data  and  the  estimated 
exponential  distribution  function: 

Tk.  =sup|i^„(a)-(l-e-"'“)|. 

oGR 

For  the  data  we  found  tks  =  0.176.  By  means  of  a  new  parametric  bootstrap 
we  simulated  100  000  realizations  of  Tks  and  found  that  all  of  them  are  smaller 
than  0.176.  What  can  you  say  about  the  p-value  corresponding  to  0.176? 

25.10  ffl  Consider  the  coal  data  from  Table  23.1,  where  23  gross  calorific  value 
measurements  are  listed  for  Osterfeld  coal  coded  262DE27.  We  modeled  this 
dataset  as  a  realization  of  a  random  sample  from  a  normal  distribution  with 
expectation  ^  unknown  and  standard  deviation  0.1  MJ/kg.  We  are  planning 
to  buy  a  shipment  if  the  gross  calorific  value  exceeds  23.75  MJ/kg.  In  order 
to  decide  whether  this  is  sensible,  we  test  the  null  hypothesis  FIq  :  /x  =  23.75 
with  test  statistic 

a.  What  would  you  choose  as  the  alternative  hypothesis? 

b.  For  the  dataset  Xn  is  23.788.  Compute  the  corresponding  p-value,  using 
that  Xn  has  an  1V(23.75,  (0.1)^/23)  distribution  under  the  null  hypothesis. 

25.11  ffl  One  is  given  a  number  t,  which  is  the  realization  of  a  random  vari¬ 
able  T  with  an  N{fi,  1)  distribution.  To  test  FIq  :  fi  =  0  against  iJi  :  /x  7^  0, 
one  uses  T  as  the  test  statistic.  One  decides  to  reject  FIq  in  favor  of  FLi  if 
|t|  >  2.  Compute  the  probability  of  committing  a  type  I  error. 


26 
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In  the  previous  chapter  we  introduced  the  setup  for  testing  a  null  hypothesis 
against  an  alternative  hypothesis  using  a  test  statistic  T.  The  notions  of  type  I 
error  and  type  II  error  were  introduced.  A  type  I  error  occurs  when  we  falsely 
reject  Hq  on  the  basis  of  the  observed  value  of  T,  whereas  a  type  II  error 
occurs  when  we  falsely  do  not  reject  Hq.  The  decision  to  reject  Hq  or  not  was 
based  on  the  size  of  the  p- value.  In  this  chapter  we  continue  the  introduction 
of  basic  concepts  of  testing  hypotheses,  such  as  significance  level  and  critical 
region,  and  investigate  the  probability  of  committing  a  type  II  error. 


26.1  Significance  level 

As  mentioned  in  the  previous  chapter,  there  is  no  general  rule  that  specifies  a 
level  below  which  the  p- value  is  considered  exceptionally  small.  However,  there 
are  situations  where  this  level  is  set  a  priori,  and  the  question  is:  which  values 
of  the  test  statistic  should  then  lead  to  rejection  of  HqI  To  illustrate  this,  con¬ 
sider  the  following  example.  The  speed  limit  on  freeways  in  The  Netherlands 
is  120  kilometers  per  hour.  A  device  next  to  freeway  A2  between  Amsterdam 
and  Utrecht  measures  the  speed  of  passing  vehicles.  Suppose  that  the  device 
is  designed  in  such  a  way  that  it  conducts  three  measurements  of  the  speed 
of  a  passing  vehicle,  modeled  by  a  random  sample  Xi,  X2,  Xq.  On  the  basis 
of  the  value  of  the  average  A3,  the  driver  is  either  fined  for  speeding  or  not. 
For  what  values  of  A3  should  we  fine  the  driver,  if  we  allow  that  5%  of  the 
drivers  are  fined  unjustly? 

Let  us  rephrase  things  in  terms  of  a  testing  problem.  Each  measurement  can 
be  thought  of  as 

measurement  =  true  speed  -I-  measurement  error. 

Suppose  for  the  moment  that  the  measuring  device  is  carefully  calibrated,  so 
that  the  measurement  error  is  modeled  by  a  random  variable  with  mean  zero 
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and  known  variance  cr^,  say  tr^  =  4.  Moreover,  in  physical  experiments  such  as 
this  one,  the  measurement  error  is  often  modeled  by  a  random  variable  with  a 
normal  distribution.  In  that  case,  the  measurements  Xi,  X2,  X3  are  modeled 
by  a  random  sample  from  an  distribution,  where  the  parameter  /r 

represents  the  true  speed  of  the  passing  vehicle.  Our  testing  problem  can  now 
be  formulated  as  testing 


Hq  :  fi  =  120  against  i?i  :  /r  >  120, 

with  test  statistic 

X,+X2+  X3  ^ 

3 

Since  sums  of  independent  normal  random  variables  again  have  a  normal  dis¬ 
tribution  (see  Remark  11.2),  it  follows  that  X3  has  an  N{iu., 4/3)  distribution. 
In  particular,  the  distribution  of  T  =  X3  is  centered  around  /r  no  matter  what 
the  value  of  fj,  is.  Values  of  T  close  to  120  are  therefore  in  favor  of  Hq.  Values  of 
T  that  are  far  from  120  are  considered  as  strong  evidence  against  Hq.  Values 
much  larger  than  120  suggest  that  /i  >  120  and  are  therefore  in  favor  of  Hi. 
Values  much  smaller  than  120  suggest  that  /i  <  120.  They  also  constitute 
evidence  against  Hq,  but  even  stronger  evidence  against  Hi.  Thus  we  reject 
Hq  in  favor  of  Hi  only  for  values  of  T  larger  than  120.  See  also  Figure  26.1. 


Values  in 
favor  of  Jfi 

n - 

120 


Fig.  26.1.  Possible  values  of  T  =  X3. 


Rejection  of  Hq  in  favor  of  Hi  corresponds  to  fining  the  driver  for  speeding. 
Unjustly  fining  a  driver  corresponds  to  falsely  rejecting  Hq,  i.e.,  committing 
a  type  I  error.  Since  we  allow  5%  of  the  drivers  to  be  fined  unjustly,  we  are 
dealing  with  a  testing  problem  where  the  probability  of  committing  a  type  I 
error  is  set  a  priori  at  0.05.  The  question  is:  for  which  values  of  T  should 
we  reject  Hq?  The  decision  rule  for  rejecting  Hq  should  be  such  that  the 
corresponding  probability  of  committing  a  type  I  error  is  0.05.  The  value  0.05 
is  called  the  significance  level. 


Significance  level.  The  significance  level  is  the  largest  accept¬ 
able  probability  of  committing  a  type  I  error  and  is  denoted  by  a, 
where  0  <  a  <  1. 


We  speak  of  “performing  the  test  at  level  a,”  as  well  as  “rejecting  Hq  in 
favor  of  Hi  at  level  a.”  In  our  example  we  are  testing  Hq  :  fj,  =  120  against 
Hi  :  yL  >  120  at  level  0.05. 
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Quick  exercise  26.1  Suppose  that  in  the  freeway  example  Hq  :  /i  =  120  is 
rejected  in  favor  of  Hi  :  n  >  120  at  level  a  =  0.05.  Will  it  necessarily  be 
rejected  at  level  a  =  0.01?  On  the  other  hand,  suppose  that  iJo  :  M  =  120 
is  rejected  in  favor  of  Hi  :  fi  >  120  at  level  a  =  0.01.  Will  it  necessarily  be 
rejected  at  level  a  =  0.05? 

Let  us  continue  with  our  example  and  determine  for  which  values  of  T  =  X3 
we  should  reject  Hg  at  level  a  =  0.05  in  favor  of  iLi  :  /r  >  120.  Suppose 
we  decide  to  fine  each  driver  whose  recorded  average  speed  is  121  or  more, 
i.e.,  we  reject  Hg  whenever  T  >  121.  Then  how  large  is  the  probability  of  a 
type  I  error  P(T  >  121)?  When  Hg  :  /i  =  120  is  true,  then  T  =  X3  has  an 
?V(120,4/3)  distribution,  so  that  by  the  change-of-units  rule  for  the  normal 
distribution  (see  page  106),  the  random  variable 

T-  120 
“  2/V3 

has  an  iV(0, 1)  distribution.  This  implies  that 


121-  120\ 
2/V3  ) 


P{Z  >  0.87). 


From  Table  B.l,  we  find  P(Z  >  0.87)  =  0.1922,  which  means  that  the  prob¬ 
ability  of  a  type  I  error  is  greater  than  the  significance  level  a  =  0.05.  Since 
this  level  was  defined  as  the  largest  acceptable  probability  of  a  type  I  error, 
we  do  not  reject  Hg.  Similarly,  if  we  decide  to  reject  Hg  whenever  we  record 
an  average  of  122  or  more,  the  probability  of  a  type  I  error  equals  0.0416 
(check  this).  This  is  smaller  than  a  =  0.05,  so  in  that  case  we  reject  Hg.  The 
boundary  case  is  the  value  c  that  satisfies  P(T  >  c)  =  0.05.  To  find  c,  we  must 
solve 


P 


c-  120^^ 

2/V3 ; 


0.05. 


From  Table  B.2  we  have  that  zq.os  =  too, 0.05  =  1.645,  so  that  we  find 


c-  120 

2/V3 


1.645, 


which  leads  to 

c=  120 -k  1.645-  ^  =  121.9. 

73 

Hence,  if  we  set  the  significance  level  a  at  0.05,  we  should  reject  Hg  :  fi  =  120 
in  favor  of  Hi  :  fj,  >  120  whenever  T  >  121.9.  For  our  freeway  example  this 
means  that  if  the  average  recorded  speed  of  a  passing  vehicle  is  greater  than 
or  equal  to  121.9,  then  the  driver  is  fined  for  speeding.  With  this  decision  rule, 
at  most  5%  of  the  drivers  get  fined  unjustly. 
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In  connection  with  p-values:  the  significance  level  is  the  level  below  which 
the  p- value  is  sufficiently  small  to  reject  Hq.  Indeed,  for  any  observed  value 
t  >  I2I.9  we  reject  Hq,  and  the  p-value  for  such  a  t  is  at  most  0.05: 

P(T  >t)  <  P{T  >  121.9)  =  0.05. 

We  will  see  more  about  this  relation  in  the  next  section. 


26.2  Critical  region  and  critical  values 

In  the  freeway  example  the  significance  level  0.05  corresponds  to  the  decision 
rule  “reject  Hq  :  /i  =  120  in  favor  Hi  :  fj,  >  120  whenever  T  >  121.9.”  The 
set  K  =  [121.9,  oo)  consisting  of  values  of  the  test  statistic  T  for  which  we 
reject  Hq  is  called  critical  region.  The  value  121.9,  which  is  the  boundary  case 
between  rejecting  and  not  rejecting  Hu,  is  called  the  critical  value. 


Critical  region  and  critical  values.  Suppose  we  test  Hq 
against  Hi  at  significance  level  a  by  means  of  a  test  statistic  T. 
The  set  K  C  M.  that  corresponds  to  all  values  of  T  for  which  we 
reject  iJg  iu  favor  of  Hi  is  called  the  critical  region.  Values  on  the 
boundary  of  the  critical  region  are  called  critical  values. 


The  precise  shape  of  the  critical  region  depends  on  both  the  chosen  significance 
level  a  and  the  test  statistic  T  that  is  used.  But  it  will  always  be  such  that 
the  probability  that  T  G  K  satisfies 

P(r  €  K)  <  a  in  the  case  that  Hq  is  true. 

At  this  point  it  becomes  important  to  emphasize  whether  probabilities  are 
computed  under  the  assumption  that  Hq  is  true.  With  a  slight  abuse  of  nota¬ 
tion,  we  briefly  write  P(r  G  K  \  Hq)  for  the  probability. 

Relation  with  p-values 

If  we  record  average  speed  t  =  124,  then  this  value  falls  in  the  critical  region 
K  =  [121.9,  oo),  so  that  Hq  :  =  120  is  rejected  in  favor  Hi  :  fj,  >  120.  On 

the  other  hand  we  can  also  compute  the  p- value  corresponding  to  the  observed 
value  124.  Since  values  of  T  to  the  right  provide  stronger  evidence  against  Hq, 
the  p-value  is  the  following  right  tail  probability 

P(T  >  124  I  Ho)  =  =  nz  >  3.46)  =  0.0003, 

which  is  smaller  than  the  significance  level  0.05.  This  is  no  coincidence. 
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In  general,  suppose  that  we  perform  a  test  at  level  a  using  test  statistic  T 
and  that  we  have  observed  t  as  the  value  of  our  test  statistic.  Then 

t  £  K  the  p-value  corresponding  to  t  is  less  than  or  equal  to  a. 

Figure  26.2  illustrates  this  for  a  testing  problem  where  values  of  T  to  the 
right  provide  evidence  against  Hq  and  in  favor  of  Hi.  In  that  case,  the  p- value 
corresponds  to  the  right  tail  probability  P(T  >  t  |  Hq).  The  shaded  area  to  the 
right  of  Ca  corresponds  to  a  =  P(T  >  Ca  |  Hq),  whereas  the  more  intensely 
shaded  area  to  the  right  of  t  represents  the  p- value.  We  see  that  deciding 
whether  to  reject  Hq  at  a  given  significance  level  a  can  be  done  by  comparing 
either  t  with  Cq,  or  the  p-value  with  a.  For  this  reason  the  p-value  is  sometimes 
called  the  observed  significance  level. 


The  concepts  of  critical  value  and  p-value  have  their  own  merit.  The  critical 
region  and  the  corresponding  critical  values  specify  exactly  what  values  of  T 
lead  to  rejection  of  Hq  at  a  given  level  a.  This  can  be  done  even  without 
obtaining  a  dataset  and  computing  the  value  t  of  the  test  statistic.  The  p- 
value,  on  the  other  hand,  represents  the  strength  of  the  evidence  the  observed 
value  t  bears  against  Hq.  But  it  does  not  specify  all  values  of  T  that  lead  to 
rejection  of  Hq  at  a  given  level  a. 

Quick  exercise  26.2  In  our  freeway  example,  we  have  already  computed 
the  relevant  tail  probability  to  decide  whether  a  person  with  recorded  average 
speed  t  =  124  gets  fined  if  we  set  the  significance  level  at  0.05.  Suppose  the 
significance  level  is  set  at  a  =  0.01  (we  allow  1%  of  the  drivers  to  get  fined 
unjustly).  Determine  whether  a  person  with  recorded  average  speed  t  =  124 
gets  fined  {Hq  :  fj,  =  120  is  rejected).  Furthermore,  determine  the  critical 
region  in  this  case. 
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Sometimes  the  critical  region  K  can  be  constructed  such  that  P(T  ^  K  \  Hq)  is 
exactly  equal  to  a,  as  in  the  freeway  example.  However,  when  the  distribution 
of  T  is  discrete,  this  is  not  always  possible.  This  is  illustrated  by  the  next 
example. 

After  the  introduction  of  the  Euro,  Polish  mathematicians  claimed  that  the 
Belgian  1  Euro  coin  is  not  a  fair  coin  (see,  for  instance,  the  New  Scientist, 
January  4,  2002).  Suppose  we  put  a  1  Euro  coin  to  the  test.  We  will  throw 
it  ten  times  and  record  X,  the  number  of  heads.  Then  X  has  a  Bin {10, p) 
distribution,  where  p  denotes  the  probability  of  heads.  We  like  to  find  out 
whether  p  differs  from  1/2.  Therefore  we  test 

Hq  •  P  =  2  against  Hi  :  p  ^  -  (the  coin  is  not  fair). 

We  use  X  as  the  test  statistic.  When  we  set  the  significance  level  a  at  0.05, 
for  what  values  of  X  will  we  reject  Hq  and  conclude  that  the  coin  is  not  fair? 
Let  us  first  find  out  what  values  of  X  are  in  favor  of  Hi.  If  Hq  :  p  =  1/2  is 
true,  then  E[A]  =  10  •  ^  =  5,  so  that  values  of  X  close  to  5  are  in  favor  Hq. 
Values  close  to  10  suggest  that  p  >  1/2  and  values  close  to  0  suggest  that 
p  <  1/2.  Hence,  both  values  close  to  0  and  values  close  to  10  are  in  favor  of 

Hi  -.p^  1/2. 


Values  in  Values  in 

favor  of  Hi  favor  of  Hi 

I - 1 - 1 

0  5  10 

Values  of  X 


This  means  that  we  will  reject  Hq  in  favor  of  iLi  whenever  A  <  c;  or  X  >  c„. 
Therefore,  the  critical  region  is  the  set 

a:  =  {0, 1, . . . ,  q}  U  {cu,  ■ .  ■ ,  9, 10}. 

The  boundary  values  c;  and  c„  are  called  left  and  right  critical  values.  They 
must  be  chosen  such  that  the  critical  region  K  is  as  large  as  possible  and  still 
satisfies 

P(A  €  A  I  Hq)=P{X<ci  |p  =  i)  +P(A>  c„  |p  =  i)  <0.05. 

Here  P(A  >  c„  |  p  =  i)  denotes  the  probability  P(A  >  c„)  computed  with  X 
having  a  Bin{10,  ^)  distribution.  Since  we  have  no  preference  for  rejecting  Hq 
for  values  close  to  0  or  close  to  10,  we  divide  0.05  over  the  two  sides,  and  we 
choose  Cl  as  large  as  possible  and  c„  as  small  as  possible  such  that 

P(A  <  Q  I  P=  5)  <  0.025  and  P(A  >  c„  |  p  =  i)  <  0.025. 
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Table  26.1.  Left  tail  probabilities  of  the  5m(10,  distribution. 


k 

VI 

k 

P(X  <  k) 

0 

0.00098 

6 

0.82813 

1 

0.01074 

7 

0.94531 

2 

0.05469 

8 

0.98926 

3 

0.17188 

9 

0.99902 

4 

0.37696 

10 

1.00000 

5 

0.62305 

The  left  tail  probabilities  of  the  Bin{10,  i)  distribution  are  listed  in  Ta¬ 
ble  26.1.  We  immediately  see  that  cj  =  1  is  the  largest  value  such  that 
P(X  <  Cl  \  p  =  1/2)  <  0.025.  Similarly,  c„  =  9  is  the  smallest  value  such  that 
P(X  >  Cu  \  p  =  1/2)  <  0.025.  Indeed,  when  X  has  a  \)  distribution, 

P(X  >  9)  =  1  _  p(x  <  8)  =  1  -  0.98926  =  0.01074, 

P{X  >  8)  =  1  -  F{X  <  7)  =  1  -  0.94531  =  0.05469. 

Hence,  if  we  test  Hq  :  p  =  1/2  against  Hi  :  p  1/2  at  level  a  =  0.05,  the 
critical  region  is  the  set  K  =  {0, 1,9, 10}.  The  corresponding  type  I  error  is 

P{X  eK)  =  P{X  <  1)  -b  P{X  >  9)  =  0.01074  -b  0.01074  =  0.02148, 

which  is  smaller  than  the  significance  level.  You  may  perform  ten  throws  with 
your  favorite  coin  and  see  whether  the  number  of  heads  falls  in  the  critical 
region. 

Quick  exercise  26.3  Recall  the  tank  example  where  we  tested  Hq  :  N  =  350 
against  Hi  :  N  <  350  by  means  of  the  test  statistic  T  =  max  W.  Suppose  that 
we  perform  the  test  at  level  0.05.  Deduce  the  critical  region  K  corresponding 
to  level  0.05  from  the  left  tail  probabilities  given  here: 

k  195  194  193  192  191 

P(r  <k\Ho)  0.0525  0.0511  0.0498  0.0485  0.0472 

Is  P(r  gK  \  Ho)  =  0.05? 

One-  and  two-tailed  p- values 

In  the  Euro  coin  example,  we  deviate  from  Hq  :  p  =  1/2  in  two  directions: 
values  of  X  both  far  to  the  right  and  far  to  the  left  of  5  are  evidence  against  Hq. 
Suppose  that  in  ten  throws  with  the  1  Euro  coin  we  recorded  x  heads.  What 
would  the  p-value  be  corresponding  to  x?  The  problem  is  that  the  direction 
in  which  values  of  X  are  at  least  as  extreme  as  the  observed  value  x  depends 
on  whether  x  lies  to  the  right  or  to  the  left  of  5. 
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At  this  point  there  are  two  natural  solutions.  One  may  report  the  appropri¬ 
ate  left  or  right  tail  probability,  which  corresponds  to  the  direction  in  which 
X  deviates  from  Hq.  For  instance,  if  x  lies  to  the  right  of  5,  we  compute 
P(Ar  >  X  I  Hq).  This  is  called  a  one-tailed  p-value.  The  disadvantage  of  one- 
tailed  p-values  is  that  they  are  somewhat  misleading  about  how  strong  the 
evidence  of  the  observed  value  x  bears  against  Hq.  In  view  of  the  relation 
between  rejection  on  the  basis  of  critical  values  or  on  the  basis  of  a  p-value, 
the  one-tailed  p-value  should  be  compared  to  a/2.  On  the  other  hand,  since 
people  are  inclined  to  compare  p-values  with  the  significance  level  a  itself, 
one  could  also  double  the  one-tailed  p-value  and  compare  this  with  a.  This 
double-tail  probability  is  called  a  two-tailed  p-value.  It  doesn’t  make  much 
of  a  difference,  as  long  as  one  also  reports  whether  the  reported  p-value  is 
one-tailed  or  two-tailed. 

Let  us  illustrate  things  by  means  of  the  findings  by  the  Polish  mathematicians. 
They  performed  250  throws  with  a  Belgian  1  Euro  coin  and  recorded  heads 
140  times  (see  also  Exercise  24.2).  The  question  is  whether  this  provides  strong 
enough  evidence  against  Hq  :  p  =  1/2.  The  observed  value  140  is  to  the  right 
of  125,  the  value  we  would  expect  if  Hq  is  true.  Hence  the  one-tailed  p-value 
is  P(A  >  140),  where  now  X  has  a  Bm(250,  ^)  distribution.  By  means  of  the 
normal  approximation  (see  page  201),  we  find 


P(Z  >  1.90)  =  1  -  4>(1.90)  =  0.0287. 


Therefore  the  two-tailed  p-value  is  approximately  0.0574,  which  does  not  pro¬ 
vide  very  strong  evidence  against  Hq.  In  fact,  the  exact  two-tailed  p-value, 
computed  by  means  of  statistical  software,  is  0.066,  which  is  even  larger. 

Quick  exercise  26.4  In  a  Dutch  newspaper  {De  Telegraaf,  January  3,  2002) 
it  was  reported  that  the  Polish  mathematicians  recorded  heads  150  times. 
What  are  the  one-  and  two-tailed  probabilities  is  this  case?  Do  they  now  have 
a  case? 


26.3  Type  II  error 

As  we  have  just  seen,  by  setting  a  significance  level  a,  we  are  able  to  control 
the  probability  of  committing  a  type  I  error;  it  will  at  most  be  a.  For  instance, 
let  us  return  to  the  freeway  example  and  suppose  that  we  adopt  the  decision 
rule  to  fine  the  driver  for  speeding  if  her  average  observed  speed  is  at  least 
121.9,  i.e., 

reject  Hq  :  p  =  120  in  favor  of  Hi  :  p  >  120  whenever  T  =  Xq  >  121.9. 
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From  Section  26.1  we  know  that  with  this  decision  rule,  the  probability  of  a 
type  I  error  is  0.05.  What  is  the  probability  of  committing  a  type  II  error? 
This  corresponds  to  the  percentage  of  drivers  whose  true  speed  is  above  120 
but  who  do  not  get  fined  because  their  recorded  average  speed  is  below  121.9. 
For  instance,  suppose  that  a  car  passes  at  true  speed  /i  =  125.  A  type  II  error 
occurs  when  T  <  121.9,  and  since  T  =  A3  has  an  A(125,4/3)  distribution, 
the  probability  that  this  happens  is 


P(T  <  121.9  I  /i=  125)  =  P 


r-  125  121.9 -  125\ 

2/V3  ^  2/V3  / 


=  $(-2.68)  =  0.0036. 


This  looks  promising,  but  now  consider  a  vehicle  passing  at  true  speed  fj,  = 
123.  The  probability  of  committing  a  type  II  error  in  this  case  is 


P(T  <  121.9  I  /r  =  123)  =  P 


T-  123 
2/V3 


< 


121.9-  123\ 

2/V3  ) 


=  $(-0.95)  =  0.1711. 


Hence  17.11%  of  all  drivers  that  pass  at  speed  /i  =  123  will  not  get  fined.  In 
Figure  26.3  the  last  situation  is  illustrated.  The  curve  on  the  left  represents  the 
probability  density  of  the  A(120,4/3)  distribution,  which  is  the  distribution 
of  T  under  the  null  hypothesis.  The  shaded  area  on  the  right  of  121.9  represents 
the  probability  of  committing  a  type  I  error 


P(r  >  121.9  I  /i=  120)  =  0.05. 

The  curve  on  the  right  is  the  probability  density  of  the  A(123,4/3)  distribu¬ 
tion,  which  is  the  distribution  of  T  under  the  alternative  /r  =  123.  The  shaded 
area  on  the  left  of  121.9  represents  the  probability  of  a  type  II  error 


I 

Do  not  reject  Hq  < — I — >  Reject  Ho 
Fig.  26.3.  Type  I  and  type  II  errors  in  the  freeway  example. 
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P(r  <  121.9  I  ^  =  123)  =  0.1711. 

Shifting  /i  further  to  the  right  will  result  in  a  smaller  probability  of  a  type  II 
error.  However,  shifting  /i  toward  the  value  120  leads  to  a  larger  probability 
of  a  type  II  error.  In  fact  it  can  be  arbitrarily  close  to  0.95. 

The  previous  example  illustrates  that  the  probability  of  committing  a  type  II 
error  depends  on  the  actual  value  of  fj,  in  the  alternative  hypothesis  Hi  :  /r  > 
120.  The  closer  /i  is  to  120,  the  higher  the  probability  of  a  type  II  error  will 
be.  In  contrast  with  the  probability  of  a  type  I  error,  which  is  always  at  most 
a,  the  probability  of  a  type  II  error  may  be  arbitrarily  close  to  1  —  a.  This  is 
illustrated  in  the  next  quick  exercise. 

Quick  exercise  26.5  What  is  the  probability  of  a  type  II  error  in  the  freeway 
example  if  /x  =  120.1? 


26.4  Relation  with  confidence  intervals 


When  testing  Hq  :  ^  =  120  against  Hi  :  ^  >  120  at  level  0.05  in  the  freeway 
example,  the  critical  value  was  obtained  by  the  formula 

co.05  =  120  +  1.645  —. 

V3 

On  the  other  hand,  using  that  has  an  Af(/x,4/3)  distribution,  a  95%  lower 
confidence  bound  for  /i  in  this  case  can  be  derived  from 

2 

ln  =  X3  -  1.645  • 

V3 

Although,  at  first  sight,  testing  hypotheses  and  constructing  confidence  inter¬ 
vals  seem  to  be  two  separate  statistical  procedures,  they  are  in  fact  intimately 
related.  In  the  freeway  example,  observe  that  for  a  given  dataset  xi,X2,X3, 


we  reject  Hq  :  ^  =  120  in  favor  of  iJi  :  /x  >  120  at  level  0.05 

<+  Ta  >  120  +  1.645  •  ^ 

V3 

<+>  X3  -  1.645  •  ^  >  120 

v3 

<+>  120  is  not  in  the  95%  one-sided  confidence  interval  for  /x. 


This  is  not  a  coincidence.  In  general,  the  following  applies.  Suppose  that  for 
some  parameter  9  we  test  Hq  :  9  =  9q.  Then 


we  reject  Hq  :  9  =  9q  in  favor  of  i7i  :  9  >  9q  at  level  a 
if  and  only  if 

9q  is  not  in  the  100(1  —  q;)%  one-sided  confidence  interval  for  9. 
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The  same  relation  holds  for  testing  against  Hi  :  6  <  Oq,  and  a  similar  relation 
holds  between  testing  against  Hi  ■.  6  ^  Oq  and  two-sided  confidence  intervals: 

we  reject  Hq  :  6  =  0q  in  favor  of  Hi  :  9o  ^  Oq  at  level  a 
if  and  only  if 

00  is  not  in  the  100(1  —  a)%  two-sided  confidence  region  for  0. 

In  fact,  one  could  use  these  facts  to  define  the  100(1— a)%  confidence  region  for 
a  parameter  0  as  the  set  of  values  0o  for  which  the  null  hypothesis  Hq  :  0  =  0o 
is  not  rejected  at  level  a. 

It  should  be  emphasized  that  these  relations  only  hold  if  the  random  variable 
that  is  used  to  construct  the  confidence  interval  relates  appropriately  to  the 
test  statistic.  For  instance,  the  preceding  relations  do  not  hold  if  on  the  one 
hand,  we  construct  a  confidence  interval  for  the  parameter  /r  of  an  iV(/i,  cr^) 
distribution  by  means  of  the  studentized  mean  (X„  —  /i)/ {Sn/ \/n) ,  and  on  the 
other  hand,  use  the  sample  median  Med„  to  test  a  null  hypothesis  for  /i. 


26.5  Solutions  to  the  quick  exercises 

26.1  In  the  first  situation,  we  reject  at  significance  level  a  =  0.05,  which 
means  that  the  probability  of  committing  a  type  I  error  is  at  most  0.05.  This 
does  not  necessarily  mean  that  this  probability  will  also  be  less  than  or  equal  to 
0.01.  Therefore  with  this  information  we  cannot  know  whether  we  also  reject 
at  level  a  =  0.01.  In  the  reversed  situation,  if  we  reject  at  level  a  =  0.01,  then 
the  probability  of  committing  a  type  I  error  is  at  most  0.01,  and  is  therefore 
also  smaller  than  0.05.  This  means  that  we  also  reject  at  level  a  =  0.05. 

26.2  To  decide  whether  we  should  reject  Hq  :  fji  =  120  at  level  0.01,  we  could 
compute  P(T  >  124  |  Hq)  and  compare  this  with  0.01.  We  have  already  seen 
that  P(T  >  124  I  Ho)  =  0.0003.  This  is  (much)  smaller  than  the  significance 
level  a  =  0.01,  so  we  should  reject. 

The  critical  region  is  K  =  [c,  oo),  where  we  must  solve  c  from 


Since  zo.oi  =  2.326,  this  means  that  c  =  120  -I-  2.326  •  (2/-\/3)  =  122.7. 

26.3  The  critical  region  is  of  the  form  K  =  {5,6,...,c},  where  the  criti¬ 
cal  value  c  is  the  largest  value,  for  which  P(T  <  c  |  Ho)  is  still  less  than  or 
equal  to  0.05.  From  the  table  we  immediately  see  that  c  =  193  and  that 
P(r  (iK\Ho)=  P(r  <  193  I  Ho)  =  0.0498,  which  is  not  equal  to  0.05. 
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26.4  By  means  of  the  normal  approximation,  for  the  one-tailed  p-value  we 
find 


P(X  >  150)  =  P 


X-  125 


=  P(Z„  >  3.16) 


> 


150  -  125 

1  -4>(3.16)  =  0.0008. 


The  two-tailed  p-value  is  0.0016.  This  is  a  lot  smaller  than  the  two-tailed  p- 
value  0.0574,  corresponding  to  140  heads.  It  seems  that  with  150  heads  the 
mathematicians  would  have  a  case;  the  Belgian  Euro  coin  would  then  appear 
not  to  be  fair. 


26.5  The  probability  of  a  type  II  error  is 


P(T  <  121.9  I  p 


120.1)  =  P 


T-  120.1  121.9  -  120.1 

< 


2/V3 

=  4>(1.56)  =  0.9406. 


2/V3 


26.6  Exercises 

26.1  Polygraphs  that  are  used  in  criminal  investigations  are  supposed  to  in¬ 
dicate  whether  a  person  is  lying  or  telling  the  truth.  However  the  procedure 
is  not  infallible,  as  is  illustrated  by  the  following  example.  An  experienced 
polygraph  examiner  was  asked  to  make  an  overall  judgment  for  each  of  a 
total  280  records,  of  which  140  were  from  guilty  suspects  and  140  from  inno¬ 
cent  suspects.  The  results  are  listed  in  Table  26.2.  We  view  each  judgment 
as  a  problem  of  hypothesis  testing,  with  the  null  hypothesis  corresponding  to 
“suspect  is  innocent”  and  the  alternative  hypothesis  to  “suspect  is  guilty.” 
Estimate  the  probabilities  of  a  type  I  error  and  a  type  II  error  that  apply  to 
this  polygraph  method  on  the  basis  of  Table  26.2. 

26.2  Consider  the  testing  problem  in  Exercise  25.11.  Compute  the  probability 
of  committing  a  type  II  error  if  the  true  value  of  ^  is  1. 

26.3  □  One  generates  a  number  x  from  a  uniform  distribution  on  the  interval 

[0,0].  One  decides  to  test  Hq  :  9  =  2  against  Hi  :  0  2  by  rejecting  Hq  if 

a;  <  0.1  or  a;  >  1.9. 

a.  Compute  the  probability  of  committing  a  type  I  error. 

b.  Compute  the  probability  of  committing  a  type  II  error  if  the  true  value 
of  0  is  2.5. 

26.4  To  investigate  the  hypothesis  that  a  horse’s  chances  of  winning  an  eight- 
horse  race  on  a  circular  track  are  affected  by  its  position  in  the  starting  lineup. 
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Table  26.2.  Examiners  and  suspects. 


Suspect’s  true  status 

Innocent 

Guilty 

Examiner’s 

assesment 

Acquitted 

131 

15 

Convicted 

9 

125 

Source:  F.S.  Horvath  and  J.E.  Reid.  The  reliability  of  polygraph  examiner 
diagnosis  of  truth  and  deception.  Journal  of  Criminal  Law,  Criminology, 
and  Police  Science,  62(2):276— 281,  1971. 


the  starting  position  of  each  of  144  winners  was  recorded  ([30]).  It  turned  out 
that  29  of  these  winners  had  starting  position  one  (closest  to  the  rail  on  the 
inside  track) .  We  model  the  number  of  winners  with  starting  position  one  by 
a  random  variable  T  with  a  5m (144,  p)  distribution.  We  test  the  hypothesis 
Hq  :  p=  1/8  against  Hi  :  p  >  1/8  at  level  a  =  0.01  with  T  as  test  statistic. 

a.  Argue  whether  the  test  procedure  involves  a  right  critical  value,  a  left 
critical  value,  or  both. 

b.  Use  the  normal  approximation  to  compute  the  critical  value  (s)  correspond¬ 
ing  to  a  =  0.01,  determine  the  critical  region,  and  report  your  conclusion 
about  the  null  hypothesis. 

26.5  ffl  Recall  Exercises  23.5  and  24.8  about  the  1500  m  speed-skating  results 
in  the  2002  Winter  Olympic  Games.  The  number  of  races  won  by  skaters 
starting  in  the  outer  lane  is  modeled  by  a  random  variable  X  with  a  Bin (25,  p) 
distribution.  The  question  of  whether  there  is  an  outer  lane  advantage  was 
investigated  in  Exercise  24.8  by  means  of  constructing  confidence  intervals 
using  the  normal  approximation.  In  this  exercise  we  examine  this  question  by 
testing  the  null  hypothesis  Hq  :  p  =  1/2  against  Hi  :  p  >  1/2  using  X  as  the 
test  statistic.  The  distribution  of  X  under  Hq  is  given  in  Table  26.3.  Out  of 
23  completed  races,  15  were  won  by  skaters  starting  in  the  outer  lane. 

a.  Compute  the  p-value  corresponding  to  a;  =  15  and  report  your  conclusion 
if  we  perform  the  test  at  level  0.05.  Does  your  conclusion  agree  with  the 
confidence  interval  you  found  for  p  in  Exercise  24.8  b? 

b.  Determine  the  critical  region  corresponding  to  significance  level  a  =  0.05. 

c.  Compute  the  probability  of  committing  a  type  I  error  if  we  base  our 
decision  rule  on  the  critical  region  determined  in  b. 
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Table  26.3.  Left  tail  probabilities  for  the  Bin{23,  |)  distribution. 


k 

P(A  <  k) 

k 

P(A  <  k) 

k 

P(A  <  k) 

0 

0.0000 

8 

0.1050 

16 

0.9827 

1 

0.0000 

9 

0.2024 

17 

0.9947 

2 

0.0000 

10 

0.3388 

18 

0.9987 

3 

0.0002 

11 

0.5000 

19 

0.9998 

4 

0.0013 

12 

0.6612 

20 

1.0000 

5 

0.0053 

13 

0.7976 

21 

1.0000 

6 

0.0173 

14 

0.8950 

22 

1.0000 

7 

0.0466 

15 

0.9534 

23 

1.0000 

d.  Use  the  normal  approximation  to  determine  the  probability  of  committing 
a  type  II  error  for  the  case  p  =  0.6,  if  we  base  our  decision  rule  on  the 
critical  region  determined  in  b. 

26.6  □  Consider  Exercises  25.2  and  25.7.  One  decides  to  test  Hq  :  p  =  1472 
against  Hi  :  p  >  1472  at  level  a  =  0.05  on  the  basis  of  the  recorded  value 
1718  of  the  test  statistic  T. 

a.  Argue  whether  the  test  procedure  involves  a  right  critical  value,  a  left 
critical  value,  or  both. 

b.  Use  the  fact  that  the  distribution  of  T  can  be  approximated  by  an  p) 
distribution  to  determine  the  critical  value(s)  and  the  critical  region,  and 
report  your  conclusion  about  the  null  hypothesis. 

26.7  A  random  sample  Xi,X2  is  drawn  from  a  uniform  distribution  on  the 
interval  [0,0].  We  wish  to  test  Hq  :  6  =  1  against  iJi  :  0  <  1  by  rejecting  if 
Xi  +  X2  <  c.  Find  the  value  of  c  and  the  critical  region  that  correspond  to  a 
level  of  significance  0.05. 

Hint:  use  Exercise  11.5. 

26.8  ffl  This  exercise  is  meant  to  illustrate  that  the  shape  of  the  critical  region 
is  not  necessarily  similar  to  the  type  of  alternative  hypothesis.  The  type  of 
alternative  hypothesis  and  the  test  statistic  used  determine  the  shape  of  the 
critical  region. 

Suppose  that  Ai,  A2, . . . ,  A„  form  a  random  sample  from  an  Exp{\)  distri¬ 
bution,  and  we  test  Hq  :  A  =  1  with  test  statistics  T  =  A„  and  T'  =  e“^" . 

a.  Suppose  we  test  the  null  hypothesis  against  i7i  :  A  >  1.  Determine  for 
both  test  procedures  whether  they  involve  a  right  critical  value,  a  left 
critical  value,  or  both. 

b.  Same  question  as  in  part  a,  but  now  test  against  Hi  :  A  yf  1. 
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26.9  ffl  Similar  to  Exercise  26.8,  but  with  a  random  sample  Xi,X2,  ■  ■  ■ 
from  an  iV(/i,  1)  distribution.  We  test  Hq  :  fj,  =  0  with  test  statistics  T  =  (X„)^ 
and  T'  =  1/X„. 

a.  Suppose  that  we  test  the  null  hypothesis  against  Hi  :  fj,  ^  0.  Determine 
the  shape  of  the  critical  region  for  both  test  procedures. 

b.  Same  question  as  in  part  a,  but  now  test  against  Hi  :  >  0. 
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The  latest 


In  many  applications  the  quantity  of  interest  can  be  represented  by  the  ex¬ 
pectation  of  the  model  distribution.  In  some  of  these  applications  one  wants 
to  know  whether  this  expectation  deviates  from  some  a  priori  specified  value. 
This  can  be  investigated  by  means  of  a  statistical  test,  known  as  the  t-test. 
We  consider  this  test  both  under  the  assumption  that  the  model  distribution 
is  normal  and  without  the  assumption  of  normality.  Furthermore,  we  discuss  a 
similar  test  for  the  slope  and  the  intercept  in  a  simple  linear  regression  model. 


27.1  Monitoring  the  production  of  ball  bearings 

A  production  line  in  a  large  industrial  corporation  are  set  to  produce  a  spe¬ 
cific  type  of  steel  ball  bearing  with  a  diameter  of  1  millimeter.  In  order  to 
check  the  performance  of  the  production  lines,  a  number  of  ball  bearings  are 
picked  at  the  end  of  the  day  and  their  diameters  are  measured.  Suppose  we  ob¬ 
serve  20  diameters  of  ball  bearings  from  the  production  lines,  which  are  listed 
in  Table  27.1.  The  average  diameter  is  X20  =  1.03  millimeter.  This  clearly 
deviates  from  the  target  value  1,  but  the  question  is  whether  the  difference 
can  be  attributed  to  chance  or  whether  it  is  large  enough  to  conclude  that 
the  production  line  is  producing  ball  bearings  with  a  wrong  diameter.  To  an¬ 
swer  this  question,  we  model  the  dataset  as  a  realization  of  a  random  sample 
Xi,X2,  ■  ■  ■ ,  X20  from  a  probability  distribution  with  expected  value  /r.  The 
parameter  /r  represents  the  diameter  of  ball  bearings  produced  by  the  produc- 


Table  27.1.  Diameters  of  ball  bearings. 


1.018 

1.009 

1.042 

1.053 

0.969 

1.002 

0.988 

1.019 

1.062 

1.032 

1.072 

0.977 

1.062 

1.044 

1.069 

1.029 

0.979 

1.096 

1.079 

0.999 

400 
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tion  lines.  In  order  to  investigate  whether  this  diameter  deviates  from  1,  we 
test  the  null  hypothesis  Hq:  ^1=  1  against  Hi  :  1. 

This  example  illustrates  a  situation  that  often  occurs:  the  data  xi,X2,  ■  ■  ■  ,Xn 
are  a  realization  of  a  random  sample  Xi,  X2, . . . ,  from  a  distribution  with 
expectation  and  we  want  to  test  whether  equals  an  a  priori  specified  value, 
say  fj,o-  According  to  the  law  of  large  numbers,  is  close  to  /i  for  large  n. 
This  suggests  a  test  statistic  based  on  —  fj,o;  realizations  of  Xn  —  Ho  close 
to  zero  are  in  favor  of  the  null  hypothesis.  Does  A„  —  /ig  suffice  as  a  test 
statistic? 

In  our  example,  Xn  —  fJ-o  =  1-03  —  I  =  0.03.  Should  we  interpret  this  as  small? 
First,  note  that  under  the  null  hypothesis  E  [A„  —  /rg]  =  M  ~  Mo  =  0.  Now,  if 
Xn  —  Mo  would  have  standard  deviation  1,  then  the  value  0.03  is  within  one 
standard  deviation  of  E  [A„  —  ^g] .  The  “m  ±  a  few  cr”  rule  on  page  185  then 
suggests  that  the  value  0.03  is  not  exceptional;  it  must  be  seen  as  a  small 
deviation.  On  the  other  hand,  if  A„  —  /rg  has  standard  deviation  0.001,  then 
the  value  0.03  is  30  standard  deviations  away  from  E  [A„  —  /rg] .  According  to 
the  “m  ±  a  few  cr”  rule  this  is  very  exceptional;  the  value  0.03  must  be  seen 
as  a  large  deviation.  The  next  quick  exercise  provides  a  concrete  example. 

Quick  exercise  27.1  Suppose  that  A„  is  a  normal  random  variable  with 
expectation  1  and  variance  1.  Determine  P(A„  —  1  >  0.03).  Find  the  same 
probability,  but  for  the  case  where  the  variance  is  (0.01)^. 

This  discussion  illustrates  that  we  must  standardize  A„  —  /cg  to  incorporate 
its  variation.  Recall  that 

^2 

Var(A„  -  Ho)  =  Var(A„)  =  — , 

where  is  the  variance  of  each  Xi.  Hence,  standardizing  A„  —  /ig  means 
that  we  should  divide  by  a  j ^fn.  Since  a  is  unknown,  we  substitute  the  sample 
standard  deviation  Sn  for  a.  This  leads  to  the  following  test  statistic  for  the 
null  hypothesis  iJg  :  m  =  Mo^ 

rp  _  A'rt  ~  Mo 
Snl^/n  ’ 


Values  of  T  close  to  zero  are  in  favor  of  iMg  :  m  =  Mo-  Large  positive  values  of 
T  suggest  that  m  >  Mo  and  large  negative  values  suggest  that  m  <  Mo;  both 
are  evidence  against  iMg. 

For  the  ball  bearing  data  one  finds  that  =  0.0372,  so  that 


Xn 

SnlV^ 


1.03-1 

0.0372/v^ 


3.607. 


This  is  clearly  different  from  zero,  but  the  question  is  whether  this  difference 
is  large  enough  to  reject  Hq  :  /r  =  1.  To  answer  this  question,  we  need  to  know 
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the  probability  distribution  of  T  under  the  null  hypothesis.  Note  that  under 
the  null  hypothesis  Hq  :  ^  =  iiq,  the  test  statistic 

j,  ^  Xn  -  fJ,0 

SnIVn 

is  the  studentized  mean  (see  also  Chapter  23) 

SnIVn' 

Hence,  under  the  null  hypothesis,  the  probability  distribution  of  T  is  the  same 
as  that  of  the  studentized  mean. 


27.2  The  one-sample  t-test 

The  classical  assumption  is  that  the  dataset  is  a  realization  of  a  random  sample 
from  an  N{p,V)  distribution.  In  that  case  our  test  statistic  T  turns  out  to 
have  a  t-distribution  under  the  null  hypothesis,  as  we  will  see  later.  For  this 
reason,  the  test  for  the  null  hypothesis  Hq  :  /i  =  /xq  is  called  the  (one-sample) 
t-test.  Without  the  assumption  of  normality,  we  will  use  the  bootstrap  to 
approximate  the  distribution  of  T.  For  large  sample  sizes,  this  distribution 
can  be  approximated  by  means  of  the  central  limit  theorem.  We  start  with 
the  first  case. 

Normal  data 

Suppose  that  the  dataset  xi,X2,  ■  ■  ■  ,Xn  is  a  realization  of  a  random  sample 
Xi,  X2, . . . ,  Xn  from  an  N{p,  V)  distribution.  Then,  according  to  the  rule  on 
page  349,  the  studentized  mean  has  a  t(n  —  1)  distribution.  An  immediate 
consequence  is  that,  under  the  null  hypothesis  Hq  :  p  =  ptQ,  also  our  test 
statistic  T  has  a  t{n  —  1)  distribution.  Therefore,  if  we  test  Hq  :  p  =  pq 
against  Hi  :  p  ^  po  at  level  a,  then  we  must  reject  the  null  hypothesis  in 
favor  of  Hi  :  p  ^  po,  it 

^  —  ^n— l,ct/2  in—l,aj2- 

Similar  decision  rules  apply  to  alternatives  Hi  :  p  >  po  and  Hi  :  p  <  po. 
Suppose  that  in  the  ball  bearing  example  we  test  Hq  :  p  =  1  against  Hi  : 
p  ^  I  at  level  a  =  0.05.  From  Table  B.2  we  find  tig, 0.025  =  2.093.  Hence,  we 
must  reject  if  T  <  —2.093  or  T  >  2.093.  For  the  ball  bearing  data  we  found 
t  =  3.607,  which  means  we  reject  the  null  hypothesis  at  level  a  =  0.05. 
Alternatively,  one  might  report  the  one-tailed  p-value  corresponding  to  the 
observed  value  t  and  compare  this  with  a/2.  The  one-tailed  p- value  is  ei¬ 
ther  a  right  or  a  left  tail  probability,  which  must  be  computed  by  means 
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of  the  t{n  —  1)  distribution.  In  our  ball  bearing  example  the  one-tailed  p- 
value  is  the  right  tail  probability  P(T  >  3.607).  From  Table  B.2  we  see 
that  this  probability  is  between  0.0005  and  0.0010,  which  is  smaller  than 
a/2  =  0.025  (to  be  precise,  by  means  of  a  statistical  software  package  we 
found  P(r  >  3.607)  =  0.00094).  The  data  provide  strong  enough  evidence 
against  the  null  hypothesis,  so  that  it  seems  sensible  to  adjust  the  settings  of 
the  production  line. 

Quick  exercise  27.2  Suppose  that  the  data  in  Table  27.1  are  from  two 
separate  production  lines.  The  first  ten  measurements  have  average  1.0194  and 
standard  deviation  0.0290,  whereas  the  last  ten  measurements  have  average 
1.0406  and  standard  deviation  0.0428.  Perform  the  t-test  Hq  :  /r  =  1  against 
i7i  :  /r  1  at  level  a  =  0.01  for  both  datasets  separately,  assuming  normality. 

Nonnormal  data 

Draw  a  rectangle  with  height  h  and  width  w  (let  us  agree  that  w  >  h),  and 
within  this  rectangle  draw  a  square  with  sides  of  length  h  (see  Figure  27.1). 
This  creates  another  (smaller)  rectangle  with  horizontal  and  vertical  sides  of 


Fig.  27.1.  Rectangle  with  square  within. 


lengths  w  —  h  and  h.  A  large  rectangle  with  a  vertical-to-horizontal  ratio  that 
is  equal  to  the  horizontal-to-vertical  ratio  for  the  small  rectangle,  i.e., 

h  w  —  h 
w  h 


was  called  a  “golden  rectangle”  by  the  ancient  Greeks,  who  often  used  these  in 
their  architecture.  After  solving  for  h/w,  we  obtain  that  the  height-to- width 
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Table  27.2.  Ratios  for  Shoshoni  rectangles. 


0.693 

0.749 

0.654 

0.670 

0.662 

0.672 

0.615 

0.606 

0.690 

0.628 

0.668 

0.611 

0.606 

0.609 

0.601 

0.553 

0.570 

0.844 

0.576 

0.933 

Source:  C.  Dubois  (ed.).  Lowie’s  selected  papers  in  anthropology^  1960. 
(c)  The  Regents  of  the  University  of  California. 


ratio  h/w  is  equal  to  the  “golden  number”  {'/b  —  l)/2  «  0.618.  The  data  in 
Table  27.2  represent  corresponding  h/w  ratios  for  rectangles  used  by  Shoshoni 
Indians  to  decorate  their  leather  goods.  Is  it  reasonable  to  assume  that  they 
were  also  using  golden  rectangles?  We  examine  this  by  means  of  a  t-test. 

The  observed  ratios  are  modeled  as  a  realization  of  a  random  sample  from  a 
distribution  with  expectation  where  the  parameter  /r  represents  the  true 
esthetic  preference  for  height-to-width  ratios  of  the  Shoshoni  Indians.  We  want 
to  test 

Hq  :  fd=  0.618  against  Hi  :  fj,  ^  0.618. 

For  the  Shoshoni  ratios,  =  0.6605  and  s„  =  0.0925,  so  that  the  value  of 
the  test  statistic  is 

0.618  0.6605-  0.618  „ 

t  = - —pz —  = - 1=^  =  2.055. 

Sn/V^  0.0925/v^ 

Closer  examination  of  the  data  indicates  that  the  normal  distribution  is  not 
the  right  model.  For  instance,  by  definition  the  height-to-width  ratios  h/w 
are  always  between  0  and  1.  Because  some  of  the  data  points  are  also  close 
to  right  boundary  1,  the  normal  distribution  is  inappropriate.  If  we  cannot 
assume  a  normal  model  distribution,  we  can  no  longer  conclude  that  our  test 
statistic  has  a  t(n  —  1)  distribution  under  the  null  hypothesis. 

Since  there  is  no  reason  to  assume  any  other  particular  type  of  distribution 
to  model  the  data,  we  approximate  the  distribution  of  T  under  the  null  hy¬ 
pothesis.  Recall  that  this  distribution  is  the  same  as  that  of  the  studentized 
mean  (see  the  end  of  Section  27.1).  To  approximate  its  distribution,  we  use 
the  empirical  bootstrap  simulation  for  the  studentized  mean,  as  described 
on  page  351.  We  generate  10  000  bootstrap  datasets  and  for  each  bootstrap 
dataset  a;)',  xj, . . . ,  cc*  ,  we  compute 

X*  -  0.6605 

sl/i/n 

In  Figure  27.2  the  kernel  density  estimate  and  empirical  distribution  function 
are  displayed  for  10  000  bootstrap  values  t* .  Suppose  we  test  Hq  :  fj,  =  0.618 
against  Hi  ■.  gi  ^  0.618  at  level  a  =  0.05.  In  the  same  way  as  in  Section  23.3, 
we  find  the  following  bootstrap  approximations  for  the  critical  values: 


c*  =  -3.334  and  cl  =  1.644. 
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Fig.  27.2.  Kernel  density  estimate  and 
bootstrap  values  t* . 


-3.334  0  1.644 


empirical  distribution  function  of  10  000 


Since  for  the  Shoshoni  data  the  value  2.055  of  the  test  statistic  is  greater 
than  1.644,  we  reject  the  null  hypothesis  at  level  0.05.  Alternatively,  we  can 
also  compute  a  bootstrap  approximation  of  the  one-tailed  p- value  correspond¬ 
ing  to  2.055,  which  is  the  right  tail  probability  P(T  >  2.055).  The  bootstrap 
approximation  for  this  probability  is: 


number  of  t* values  greater  than  or  equal  to  2.055 
10  000 


0.0067. 


Hence  P(r  >  2.055)  «  0.0067,  which  is  smaller  than  a/2  =  0.025.  The  value 
2.055  should  be  considered  as  exceptionally  large,  and  we  reject  the  null  hy¬ 
pothesis.  The  esthetic  preference  for  height-to-width  ratios  of  the  Shoshoni 
Indians  differs  from  that  of  the  ancient  Greeks. 


Large  samples 

For  large  sample  sizes  the  distribution  of  the  studentized  mean  can  be  ap¬ 
proximated  by  a  standard  normal  distribution  (see  Section  23.4).  This  means 
that  for  large  sample  sizes  the  distribution  of  the  t-test  statistic  under  the 
null  hypothesis  can  also  be  approximated  by  a  standard  normal  distribution. 
To  illustrate  this,  recall  the  Old  Faithful  data.  Park  rangers  in  Yellowstone 
National  Park  inform  the  public  about  the  behavior  of  the  geyser,  such  as  the 
expected  time  between  successive  eruptions  and  the  length  of  the  duration  of 
an  eruption.  Suppose  they  claim  that  the  expected  length  of  an  eruption  is 
4  minutes  (240  seconds).  Does  this  seem  likely  on  the  basis  of  the  data  from 
Section  15.1?  We  investigate  this  by  testing  Hq  :  ^  =  240  against  Hi  :  ^  240 

at  level  a  =  0.001,  where  /i  is  the  expectation  of  the  model  distribution.  The 
value  of  the  test  statistic  is 

Xn  -  240  209.3  -  240 

f  —  _ii _ = _ =  _7  39 

Sn/Vn  68.48/V2^ 
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The  one-tailed  p-value  P(T  <  —7.39)  can  be  approximated  by  P(Z  <  —7.39), 
where  Z  has  an  iV(0, 1)  distribution.  From  Table  B.l  we  see  that  this  probabil¬ 
ity  is  smaller  than  P(Z  <  —3.49)  =  0.0002.  This  is  smaller  than  a/2  =  0.0005, 
so  we  reject  the  null  hypothesis  at  level  0.001.  In  fact  the  p-value  is  much 
smaller:  a  statistical  software  package  gives  P{Z  <  —7.39)  =  7.5  •  10“^^.  The 
data  provide  overwhelming  evidence  against  Hq  :  /i  =  240,  so  that  we  conclude 
that  the  expected  length  of  an  eruption  is  different  from  4  minutes. 

Quick  exercise  27.3  Compute  the  critical  region  K  for  the  test,  using  the 
normal  approximation,  and  check  that  t  =  —7.39  falls  in  K. 

In  fact,  if  we  would  test  Hq  :  fjL  =  240  against  Hi  :  fi  <  240,  the  p- value 
corresponding  to  f  =  —7.39  is  the  left  tail  probability  P(r  <  —7.39).  This 
probability  is  very  small,  so  that  we  also  reject  the  null  hypothesis  in  favor 
of  this  alternative  and  conclude  that  the  expected  length  of  an  eruption  is 
smaller  than  4  minutes. 


27.3  The  t-test  in  a  regression  setting 

Is  calcium  in  your  drinking  water  good  for  your  health?  In  England  and  Wales, 
an  investigation  of  environmental  causes  of  disease  was  conducted.  The  annual 
mortality  rate  (percentage  of  deaths)  and  the  calcium  concentration  in  the 
drinking  water  supply  were  recorded  for  61  large  towns.  The  data  in  Table  27.3 
represent  the  annual  mortality  rate  averaged  over  the  years  1958-1964,  and 
the  calcium  concentration  in  parts  per  million.  In  Figure  27.3  the  61  paired 
measurements  are  displayed  in  a  scatterplot.  The  scatterplot  shows  a  slight 
downward  trend,  which  suggests  that  higher  concentrations  of  calcium  lead 
to  lower  mortality  rates.  The  question  is  whether  this  is  really  the  case  or  if 
the  slight  downward  trend  should  be  attributed  to  chance. 

To  investigate  this  question  we  model  the  mortality  data  by  means  of  a  simple 
linear  regression  model  with  normally  distributed  errors,  with  the  mortality 
rate  as  the  dependent  variable  y  and  the  calcium  concentration  as  the  inde¬ 
pendent  variable  x: 

Yi  =  a  +  (3xi  +  Ui  for  i  =  1,2,...,  61, 

where  C/i,  U2,  ■ .  ■ ,  Clei  is  a  random  sample  from  an  N{0,  a^)  distribution.  The 
parameter  [3  represents  the  change  of  the  mortality  rate  if  we  increase  the 
calcium  concentration  by  one  unit.  We  test  the  null  hypothesis  Hq  :  /3  =  0 
(calcium  has  no  effect  on  the  mortality  rate)  against  Hi  :  (3  <  0  (higher 
concentration  of  calcium  reduces  the  mortality  rate) . 

This  example  illustrates  the  general  situation,  where  the  dataset 


{Xi,yi),  {X2,y2),  •  ■  ■  ,  {Xn,yn) 
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Table  27.3.  Mortality  data. 


Rate 

Calcium 

Rate 

Calcium 

Rate 

Calcium 

Rate 

Calcium 

1247 

105 

1466 

5 

1299 

78 

1359 

84 

1392 

73 

1307 

78 

1254 

96 

1318 

122 

1260 

21 

1096 

138 

1402 

37 

1309 

59 

1259 

133 

1175 

107 

1486 

5 

1456 

90 

1236 

101 

1369 

68 

1257 

50 

1527 

60 

1627 

53 

1486 

122 

1485 

81 

1519 

21 

1581 

14 

1625 

13 

1668 

17 

1800 

14 

1609 

18 

1558 

10 

1807 

15 

1637 

10 

1755 

12 

1491 

20 

1555 

39 

1428 

39 

1723 

44 

1379 

94 

1742 

8 

1574 

9 

1569 

91 

1591 

16 

1772 

15 

1828 

8 

1704 

26 

1702 

44 

1427 

27 

1724 

6 

1696 

6 

1711 

13 

1444 

14 

1591 

49 

1987 

8 

1495 

14 

1587 

75 

1713 

71 

1557 

13 

1640 

57 

1709 

71 

1625 

20 

1378 

71 

Source:  M.  Hills  and  the  M345  Course  Team.  M345  Statistical  Methods, 
Units  3:  Examining  Straight-line  Data,  1986,  Milton  Keynes:  (c)  Open  Uni¬ 
versity,  28.  Data  provided  by  Professor  M.J. Gardner,  Medical  Research  Coun¬ 
cil  Environmental  Epidemiology  Research  Unit,  Southampton. 
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Fig.  27.3.  Scatterplot  mortality  data. 


is  modeled  by  a  simple  linear  regression  model,  and  one  wants  to  test  a  null 
hypothesis  of  the  form  Hq  :  a  =  oq  or  Hq  :  P  =  Pq.  Similar  to  the  one-sample 
t-test  we  will  construct  a  test  statistic  for  each  of  these  null  hypotheses.  With 
normally  distributed  errors,  these  test  statistics  have  a  t-distribution  under 
the  null  hypothesis.  For  this  reason,  for  both  null  hypotheses  the  test  is  called 
a  t-test. 
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The  t-test  for  the  slope 

For  the  null  hypothesis  Hq  :  P  =  Po,  we  use  as  test  statistic 


Tb 


P  —  Po 
Sb 


where  P  is  the  least  squares  estimator  for  P  (see  Chapter  22)  and 

c2  _ _ _ -2 

In  this  expression, 

1  " 

(7^  = - T  -  a-  Pxif 

n  —  2 

is  the  estimator  for  as  introduced  on  page  332.  It  can  be  shown  that 


SO  that  the  random  variable  is  an  estimator  for  the  variance  of  P  —  Pq. 
Hence,  similar  to  the  test  statistic  for  the  one-sample  Ttest,  the  test  statistic  Tb 
compares  the  estimator  P  with  the  value  Pq  and  standardizes  by  dividing  by 
an  estimator  for  the  standard  deviation  of  /3  —  /3o  ■  Values  of  Tb  close  to  zero 
are  in  favor  of  the  null  hypothesis  Hq  :  P  =  Pq.  Large  positive  values  of  Tb 
suggest  that  P  >  Po,  whereas  large  negative  values  of  Tb  suggest  that  P  <  Pq. 
Recall  that  in  the  case  of  normal  random  samples  the  one-sample  i-test  statis¬ 
tic  has  a,  t{n—  1)  distribution  under  the  null  hypothesis.  For  the  same  reason, 
it  is  also  a  fact  that  in  the  case  of  normally  distributed  errors  the  test  statis¬ 
tic  Tb  has  a  t{n  —  2)  distribution  under  the  null  hypothesis  Hq  :  P  =  Pq. 

In  our  mortality  example  we  want  to  test  Hq  :  P  =  0  against  Hq  :  P  <  0.  For 
the  data  we  find  P  =  —3.2261  and  Sb  =  0.4847,  so  that  the  value  of  Tb  is 


tb 


-3.2261 

0.4847 


-6.656. 


If  we  test  at  level  a  =  0.05,  then  we  must  compare  this  value  with  the  left 
critical  value  — tsg.o.os-  This  value  is  not  in  Table  B.2,  but  we  have  that 


—  1.676  —  — ^50,0.05  <  —^59,0.05- 

This  means  that  tb  is  much  smaller  than  — tsg.o.osj  so  that  we  reject  the  null  hy¬ 
pothesis  at  level  0.05.  How  much  evidence  the  value  tb  =  —6.656  bears  against 
the  null  hypothesis  is  expressed  by  the  one-tailed  p-value  P{Tb  <  —6.656). 
From  Table  B.2  we  can  only  see  that  this  probability  is  smaller  than  0.0005. 
By  means  of  a  statistical  package  we  find  P(T{,  <  —6.656)  =  5.2  •  10“®.  The 
data  provide  overwhelming  evidence  against  the  null  hypothesis.  We  conclude 
that  higher  concentrations  of  calcium  correspond  to  lower  mortality  rates. 
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Quick  exercise  27.4  The  data  in  Table  27.3  can  be  separated  into  measure¬ 
ments  for  towns  at  least  as  far  north  as  Derby  and  towns  south  of  Derby.  For 
the  data  corresponding  to  35  towns  at  least  as  far  north  as  Derby,  one  finds 
/3  =  —1.9313  and  Sb  =  0.8479.  Test  Hq  :  (3  =  0  against  Hq  :  (3  <  0  a,t  level 
0.01,  i.e.,  compute  the  value  of  the  test  statistic  and  report  your  conclusion 
about  the  null  hypothesis. 


The  t-test  for  the  intercept 

We  test  the  null  hypothesis  Hq  :  a  =  ao  with  test  statistic 


T„,= 


a  —  ao 


where  a  is  the  least  squares  estimator  for  a  and 


(27.1) 


with  defined  as  before.  The  random  variable  S'i  is  an  estimator  for  the 


variance 


Var(d  —  do)  = 


nE^f-(E^0^  ■ 

Again,  we  compare  the  estimator  a  with  the  value  oq  and  standardize  by 
dividing  by  an  estimator  for  the  standard  deviation  of  a  —  ao.  Values  of  Ta 
close  to  zero  are  in  favor  of  the  null  hypothesis  Ho  :  a  =  ao-  Large  positive 
values  of  Tq  suggest  that  a  >  ao,  whereas  large  negative  values  of  Ta  suggest 
that  a  <  ao-  Like  Tb,  in  the  case  of  normal  errors,  the  test  statistic  Ta  has  a 
t{n  —  2)  distribution  under  the  null  hypothesis  Ho  a  =  ao- 
As  an  illustration,  recall  Exercise  17.9  where  we  modeled  the  volume  y  of 
black  cherry  trees  by  means  of  a  linear  model  without  intercept,  with  inde¬ 
pendent  variable  x  =  d?h,  where  d  and  h  are  the  diameter  and  height  of  the 
trees.  The  scatterplot  of  the  pairs  {xi,yi),  (x2, 1/2),  •  ■  ■ ,  (2^31  >  1/31)  is  displayed 
in  Figure  27.4.  As  mentioned  in  Exercise  17.9,  there  are  physical  reasons  to 
leave  out  the  intercept.  We  want  to  investigate  whether  this  is  confirmed  by 
the  data.  To  this  end,  we  model  the  data  by  a  simple  linear  regression  model 
with  intercept 

Yi  =  a  +  (3xi  +  Ui  for  i  =  1,2,...,  31, 


where  Ui,  U2,  ■  ■  ■ ,  U31  are  a  random  sample  from  an  A(0,  a^)  distribution,  and 
we  test  Ho  :  a  =  0  against  Hi  :  a  ^  0  at  level  0.10.  The  value  of  the  test 
statistic  is 


t 


a 


-0.2977 

0.9636 


-0.3089, 


and  the  left  critical  value  is  — t29,o.05  =  —1.699.  This  means  we  cannot  reject 
the  null  hypothesis.  The  data  do  not  provide  sufficient  evidence  against  Ho  : 
a  =  0,  which  is  confirmed  by  the  one-tailed  p- value  P(Ta  <  —0.3089)  =  0.3798 
(computed  by  means  of  a  statistical  package) .  We  conclude  that  the  intercept 
does  not  contribute  significantly  to  the  model. 
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Fig.  27.4.  Scatterplot  of  the  black  cherry  tree  data. 


27.4  Solutions  to  the  quick  exercises 

27.1  If  Y  has  an  Af(l,  1)  distribution,  then  Y  —  1  has  an  iV(0,l)  distri¬ 
bution.  Therefore,  from  Table  B.l:  P(y  —  1  >  0.03)  =  0.4880.  If  Y  has  an 
A^(l,  (O.OI)^)  distribution,  then  (Y  —  I)/0.0I  has  an  Af(0, 1)  distribution.  In 
that  case. 


27.2  For  the  first  and  last  ten  measurements  the  values  of  the  test  statistic 


are 


1.0194-1 

0.0290/v^ 


1.0406-1 

0.0428/v^ 


=  2.115  and  t  = 


=  3.000. 


The  critical  value  tg, 0.025  =  2.262,  which  means  we  reject  the  null  hypothesis 
for  the  second  production  line,  but  not  for  the  first  production  line. 

27.3  The  critical  region  is  of  the  form  K  =  (— oo,c/]  U  [Cii,oo).  The  right 
critical  value  c„  is  approximated  by  zo.0005  =  ^oo. 0.0005  =  3.291,  which  can  be 
found  in  Table  B.2.  By  symmetry  of  the  normal  distribution,  the  left  critical 
value  Cl  is  approximated  by  —2:0.0005  =  —3.291.  Clearly,  t  =  —7.39  <  —3.291, 
so  that  it  falls  in  K. 

27.4  The  value  of  the  test  statistic  is 


The  left  critical  value  is  equal  to  — ^33,0.011  which  is  not  in  Table  B.2,  but  we 
see  that  — ^33,0.01  <  ~^40,o.oi  =  —2.423.  This  means  that  — ^33,0.01  <  h,  so 
that  we  cannot  reject  Hq  :  P  =  0  against  Hq  :  P  <  0  a,t  level  0.01. 
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27.5  Exercises 

27.1  We  perform  a  i-test  for  the  null  hypothesis  Hq  :  ^  =  10  by  means  of 
a  dataset  consisting  of  n  =  16  elements  with  sample  mean  11  and  sample 
variance  4.  We  use  significance  level  0.05. 

a.  Should  we  reject  the  null  hypothesis  in  favor  of  Hi  :  /i  yf  10? 

b.  What  if  we  test  against  Hi  :  fi  >  10? 

27.2  □  The  Cleveland  Casting  Plant  is  a  large  highly  automated  producer  of 
gray  and  nodular  iron  automotive  castings  for  Ford  Motor  Company.  One 
process  variable  of  interest  to  Cleveland  Casting  is  the  pouring  tempera¬ 
ture  of  molten  iron.  The  pouring  temperatures  (in  degrees  Fahrenheit)  of  ten 
crankshafts  are  given  in  Table  27.4.  The  target  setting  for  the  pouring  tem¬ 
perature  is  set  at  2550  degrees.  One  wants  to  conduct  a  test  at  level  a  =  0.01 
to  determine  whether  the  pouring  temperature  differs  from  the  target  setting. 


Table  27.4.  Pouring  temperatures  of  ten  crankshafts. 


2543 

2541 

2544 

2620 

2560 

2559 

2562 

2553 

2552 

2553 

(c)  1995  From  A  structural  model  relating  process  inputs  and  final  prod¬ 
uct  characteristics,  Quality  Engineering,  ,  Vol  7,  No.  4,  pp.  693-704,  by 
Price,  B.  and  Barth,  B.  Reproduced  by  permission  of  Taylor  Sz  Francis,  Inc., 
http/ /www. taylorandfrancis.com 


a.  Formulate  the  appropriate  null  hypothesis  and  alternative  hypothesis. 

b.  Compute  the  value  of  the  test  statistic  and  report  your  conclusion.  You 
may  assume  a  normal  model  distribution  and  use  that  the  sample  variance 
is  517.34. 

27.3  Table  27.5  lists  the  results  of  tensile  adhesion  tests  on  22  U-700  alloy 
specimens.  The  data  are  loads  at  failure  in  MPa.  The  sample  mean  is  13.71 
and  the  sample  standard  deviation  is  3.55.  You  may  assume  that  the  data 
originated  from  a  normal  distribution  with  expectation  /r.  One  is  interested 
in  whether  the  load  at  failure  exceeds  10  MPa.  We  investigate  this  by  means 
of  a  t-test  for  the  null  hypothesis  Hq  :  /r  =  10. 

a.  What  do  you  choose  as  the  alternative  hypothesis? 

b.  Compute  the  value  of  the  test  statistic  and  report  your  conclusion,  when 
performing  the  test  at  level  0.05. 
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Table  27.5.  Loads  at  failure  of  U-700  specimens. 


19.8 

18.5 

17.6 

16.7 

15.8 

15.4 

14.1 

13.6 

11.9 

11.4 

11.4 

00 

bo 

7.5 

15.4 

15.4 

19.5 

14.9 

12.7 

11.9 

11.4 

10.1 

7.9 

Source:  C.C.  Berndt.  Instrumented  Tensile  adhesion  tests  on  plasma  sprayed 
thermal  barrier  coatings.  Journal  of  Materials  Engineering  11(4):  275-282, 
Dec  1989.  (§)  Springer-Verlag  New  York  Inc. 


27.4  Consider  the  coal  data  from  Table  23.2,  where  22  gross  calorific  value 
measurements  are  listed  for  Daw  Mill  coal  coded  258GB41.  We  modeled  this 
dataset  as  a  realization  of  a  random  sample  from  an  cr^)  distribution 
with  ^  and  a  unknown.  We  are  planning  to  buy  a  shipment  if  the  gross 
calorific  value  exceeds  31.00  MJ/kg.  The  sample  mean  and  sample  variance 
of  the  data  are  Xn  =  31.012  and  s„  =  0.1294.  Perform  a  t-test  for  the  null 
hypothesis  Hq  ■  fJ.  =  31.00  against  Hi  :  ^  >  31.00  using  significance  level  0.01, 
i.e.,  compute  the  value  of  the  test  statistic,  the  critical  value  of  the  test,  and 
report  your  conclusion. 

27.5  ffl  In  the  November  1988  issue  of  Science  a  study  was  reported  on  the 
inbreeding  of  tropical  swarm-founding  wasps.  Each  member  of  a  sample  of 
197  wasps  was  captured,  frozen,  and  subjected  to  a  series  of  genetic  tests, 
from  which  an  inbreeding  coefficient  was  determined.  The  sample  mean  and 
the  sample  standard  deviation  of  the  coefficients  are  xigr  =  0.044  and  S197  = 
0.884.  If  a  species  does  not  have  the  tendency  to  inbreed,  their  true  inbreeding 
coefficient  is  0.  Determine  by  means  of  a  test  whether  the  inbreeding  coefficient 
for  this  species  of  wasp  exceeds  0. 

a.  Formulate  the  appropriate  null  hypothesis  and  alternative  hypothesis  and 
compute  the  value  of  the  test  statistic. 

b.  Compute  the  p-value  corresponding  to  the  value  of  the  test  statistic  and 
report  your  conclusion  about  the  null  hypothesis. 

27.6  The  stopping  distance  of  an  automobile  is  related  to  its  speed.  The  data 
in  Table  27.6  give  the  stopping  distance  in  feet  and  speed  in  miles  per  hour 
of  an  automobile.  The  data  are  modeled  by  means  of  simple  linear  regression 
model  with  normally  distributed  errors,  with  the  square  root  of  the  stopping 
distance  as  dependent  variable  y  and  the  speed  as  independent  variable  x: 

Yi  =  a  +  jSxi  +  Ui,  for  z  =  1, . . . ,  7. 


For  the  dataset  we  find 


d  =  5.388,  /3  =  4.252,  =  1-874,  =  0.242. 
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Table  27.6.  Speed  and  stopping  distance  of  automobiles. 


Speed 

20.5 

20.5 

30.5 

30.5 

40.5 

48.8 

57.8 

Distance 

15.4 

13.3 

33.9 

27.0 

73.1 

113.0 

142.6 

Source:  K.A.  Brownlee.  Statistical  theory  and  methodology  in  science  and 
engineering.  Wiley,  New  York,  1960;  Table  II. 9  on  page  372. 


One  would  expect  that  the  intercept  can  be  taken  equal  to  0,  since  zero  speed 
would  yield  zero  stopping  distance.  Investigate  whether  this  is  confirmed  by 
the  data  by  performing  the  appropriate  test  at  level  0.10.  Formulate  the  proper 
null  and  alternative  hypothesis,  compute  the  value  of  the  test  statistic,  and 
report  your  conclusion. 

27.7  ffl  In  a  study  about  the  effect  of  wall  insulation,  the  weekly  gas  con¬ 
sumption  (in  1000  cubic  feet)  and  the  average  outside  temperature  (in  de¬ 
grees  Celsius)  was  measured  of  a  certain  house  in  southeast  England,  for  26 
weeks  before  and  30  weeks  after  cavity-wall  insulation  had  been  installed. 
The  house  thermostat  was  set  at  20  degrees  throughout.  The  data  are  listed 
in  Table  27.7.  We  model  the  data  before  insulation  by  means  of  a  simple  lin¬ 
ear  regression  model  with  normally  distributed  errors  and  gas  consumption 
as  response  variable.  A  similar  model  was  used  for  the  data  after  insulation. 
Given  are 

Before  insulation:  a  =  6.8538,  /3  =  —0.3932  and  Sa  =  0.1184,  Sh  =  0.0196 
After  insulation:  a  =  4.7238,  /3  =  —0.2779  and  Sa  =  0.1297,  Sb  =  0.0252. 

a.  Use  the  data  before  insulation  to  investigate  whether  smaller  outside  tem¬ 
peratures  lead  to  higher  gas  consumption.  Formulate  the  proper  null  and 
alternative  hypothesis,  compute  the  value  of  the  test  statistic,  and  report 
your  conclusion,  using  significance  level  0.05. 

b.  Do  the  same  for  the  data  after  insulation. 
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Table  27.7.  Temperature  and  gas  consumption. 


Before 

Temperature 

insulation 

Gas  consumption 

After  insulation 

Temperature  Gas  consumption 

-0.8 

7.2 

-0.7 

4.8 

-0.7 

6.9 

0.8 

4.6 

0.4 

6.4 

1.0 

4.7 

2.5 

6.0 

1.4 

4.0 

2.9 

5.8 

1.5 

4.2 

3.2 

5.8 

1.6 

4.2 

3.6 

5.6 

2.3 

4.1 

3.9 

4.7 

2.5 

4.0 

4.2 

5.8 

2.5 

3.5 

4.3 

5.2 

3.1 

3.2 

5.4 

4.9 

3.9 

3.9 

6.0 

4.9 

4.0 

3.5 

6.0 

4.3 

4.0 

3.7 

6.0 

4.4 

4.2 

3.5 

6.2 

4.5 

4.3 

3.5 

6.3 

4.6 

4.6 

3.7 

6.9 

3.7 

4.7 

3.5 

7.0 

3.9 

4.9 

3.4 

7.4 

4.2 

4.9 

3.7 

7.5 

4.0 

4.9 

4.0 

7.5 

3.9 

5.0 

3.6 

7.6 

3.5 

5.3 

3.7 

8.0 

4.0 

6.2 

2.8 

8.5 

3.6 

7.1 

3.0 

9.1 

3.1 

7.2 

2.8 

10.2 

2.6 

7.5 

2.6 

8.0 

2.7 

8.7 

2.8 

8.8 

1.3 

9.7 

1.5 

Source:  MDST242  Statistics  in  Society,  Unit  45:  Review,  2nd  edition,  1984, 
Milton  Keynes:  (c)  The  Open  University,  Figures  2.5  and  2.6. 
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Comparing  two  samples 


Many  applications  are  concerned  with  two  groups  of  observations  of  the  same 
kind  that  originate  from  two  possibly  different  model  distributions,  and  the 
question  is  whether  these  distributions  have  different  expectations.  We  de¬ 
scribe  a  test  for  equality  of  expectations,  where  we  consider  normal  and  non¬ 
normal  model  distributions  and  equal  and  unequal  variances  of  the  model 
distributions. 


28.1  Is  dry  drilling  faster  than  wet  drilling? 

Recall  the  drilling  example  from  Sections  15.5  and  16.4.  The  question  was 
whether  dry  drilling  is  faster  than  wet  drilling.  The  scatterplots  in  Figure  15.11 
seem  to  suggest  that  up  to  a  depth  of  250  feet  the  drill  time  does  not  depend 
on  depth.  Therefore,  for  a  first  investigation  of  a  possible  difference  between 
dry  and  wet  drilling  we  only  consider  the  (mean)  drill  times  up  to  this  depth. 
A  more  thorough  study  can  be  found  in  [23] . 

The  boxplots  of  the  drill  times  for  both  types  of  drilling  are  displayed  in 
Figure  28.1.  Clearly,  the  boxplot  for  dry  drilling  is  positioned  lower  than  the 


1000  - 

900  - 

o  . 

800  - 

700  - 

B 

600  - 

Dry  Wet 

Fig.  28.1.  Boxplot  of  drill  times. 
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one  for  wet  drilling.  However,  the  question  is  whether  this  difference  can  be 
attributed  to  chance  or  if  it  is  large  enough  to  conclude  that  the  dry  drill 
time  is  shorter  than  the  wet  drill  time.  To  answer  this  question,  we  model  the 
datasets  of  dry  and  wet  drill  times  as  realizations  of  random  samples  from 
two  distribution  functions  F  and  G,  one  with  expected  value  /ii  and  the  other 
with  expected  value  /i2.  The  parameters  /ii  and  /i2  represent  the  drill  times 
of  dry  drilling  and  wet  drilling,  respectively.  We  test  Hq  =  fj,2  against 

Hi  :  /ii  <  ^2- 

This  example  illustrates  a  general  situation  where  we  compare  two  datasets 
Xi,X2,...,Xn  and  yi,  y2,  ■  ■  • ,  2/m, 
which  are  the  realization  of  independent  random  samples 
JCi ,  X2 ,  ■ .  ■ ,  and  W  ,  ^2 ,  ■  ■  •  , 

from  two  distributions,  and  we  want  to  test  whether  the  expectations  of  both 
distributions  are  the  same.  Both  the  variance  of  the  Xi  and  the  variance 
(Ty  of  the  Yj  are  unknown. 

Note  that  the  null  hypothesis  is  equivalent  to  the  statement  /ii  —  2^2  =  0.  For 
this  reason,  similar  to  Chapter  27,  the  test  statistic  for  the  null  hypothesis 
Hu  :  /ri  =  2*2  is  based  on  an  estimator  —  Ym  for  the  difference  2*1  —  2*2-  As 
before,  we  standardize  —  Y^  by  an  estimator  for  its  variance 

Var(X„  —  Wn.)  = - 1 - • 

n  m 

Recall  that  the  sample  variances  Sx  and  Sy  of  the  Xi  and  Yj,  are  unbiased 
estimators  for  cr|-  and  ay.  We  will  use  a  combination  of  Sx  and  Sy  to  con¬ 
struct  an  estimator  for  Var(A„  —  Yn).  The  actual  standardization  of  X^  —  Ym 
depends  on  whether  the  variances  of  the  Xi  and  Yj  are  the  same.  We  distin¬ 
guish  between  the  two  cases  (j\  =  cry  and  a\  cry.  In  the  next  section  we 
consider  the  case  of  equal  variances. 

Quick  exercise  28.1  Looking  at  the  boxplots  in  Figure  28.1,  does  the  as¬ 
sumption  cr^  =  (Ty  seem  reasonable  to  you?  Can  you  think  of  a  way  to 
quantify  your  belief? 


28.2  Two  samples  with  equal  variances 

Suppose  that  the  samples  originate  from  distributions  with  the  same  (but 
unknown)  variance: 

—  Gy  —  ^  ’ 

In  this  case  we  can  pool  the  sample  variances  and  Sy  by  constructing 
a  linear  combination  aSx  +  bSy  that  is  an  unbiased  estimator  for  cr^.  One 
particular  choice  is  the  weighted  average 


28.2  Two  samples  with  equal  variances 


417 


(n  —  1)5'^  +  (m  —  l)S'y 

n  +  m  —  2 

It  has  the  property  that  for  normally  distributed  samples  it  has  the  smallest 
variance  among  all  unbiased  linear  combinations  of  and  Sy  (see  Exer¬ 
cise  28.5).  Moreover,  the  weights  depend  on  the  sample  sizes.  This  is  appro¬ 
priate,  since  if  one  sample  is  much  larger  than  the  other,  the  estimate  of  tr^ 
from  that  sample  is  more  reliable  and  should  receive  greater  weight. 

We  find  that  the  pooled-variance: 

2  {n  —  l)Sx  +  {'m  —  1)Sy  fl  1 


S^  = 


is  an  unbiased  estimator  for 


n  -\-  m  —  2 


Var(X„  -  Ym)  =  cr^  -h  — )  . 

\n  m  J 


This  leads  to  the  following  test  statistic  for  the  null  hypothesis  Hq  :  =  ^2: 


Tp  = 


X„,  -  W 


Sr, 


As  before,  we  compare  the  estimator  A„  —  Ym  with  0  (the  value  of  fj-i  —  fi2 
under  the  null  hypothesis) ,  and  we  standardize  by  dividing  by  the  estimator  Sp 
for  the  standard  deviation  of  A„  —  Yrn  ■  Values  of  Tp  close  to  zero  are  in  favor 
of  the  null  hypothesis  Hq  :  pLi  =  Large  positive  values  of  Tp  suggest  that 
1^1  >  M2,  whereas  large  negative  values  suggest  that  pLi  <  /j,2. 

The  next  step  is  to  determine  the  distribution  of  Tp.  Note  that  under  the  null 
hypothesis  Hq  :  /ii  =  p.2,  the  test  statistic  Tp  is  the  pooled  studentized  mean 
difference 

{Xji  Ym)  (/il  /X2) 

^  ■ 

Hence,  under  the  null  hypothesis,  the  probability  distribution  of  Tp  is  the 
same  as  that  of  the  pooled  studentized  mean  difference.  To  determine  its 
distribution,  we  distinguish  between  normal  and  nonnormal  data. 


Normal  samples 

In  the  same  way  as  the  studentized  mean  of  a  single  normal  sample  has  a 
t{n  —  1)  distribution  (see  page  349),  it  is  also  a  fact  that  if  two  independent 
samples  originate  from  normal  distributions,  i.e., 

Xi,X2, . . . ,  Xn  random  sample  from  A(/ii,  cr^) 

Yi,  l2)  •  •  ■ )  Ym  random  sample  from  N{fi2,a‘^), 


then  the  pooled  studentized  mean  difference  has  a  t{n  m  —  2)  distribution. 
Hence,  under  the  null  hypothesis,  the  test  statistic  Tp  has  a  t{n  m  —  2) 
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distribution.  For  this  reason,  a  test  for  the  null  hypothesis  Hq  '■  fJ-i  =  fJ-2  is 
called  a  two-sample  t-test. 

Suppose  that  in  our  drilling  example  we  model  our  datasets  as  realizations 
of  random  samples  of  sizes  n  =  m  =  50  from  two  normal  distributions  with 
equal  variances,  and  we  test  Hq  =  p,2  against  Hi  :  pLi  <  p,2  at  level  0.05. 
For  the  data  we  find  =  727.78,  j/50  =  873.02,  and  Sp  =  13.62,  so  that 


t 


p  ~ 


727.78  -  873.02 
13.62 


-10.66. 


We  compare  this  with  the  left  critical  value  — tgs.o.os-  This  value  is  not  in 
Table  B.2,  but  —1.676  =  — ^50,0.05  <  ~^98,o.05-  This  means  that  tp  <  — tgs.o.os, 
so  that  we  reject  Hq  :  /ii  =  /i2  in  favor  of  iJi  :  /ii  <  pt2  at  level  0.05.  The  p- 
value  corresponding  to  tp  =  —10.66  is  the  left  tail  probability  P(T  <  —10.66). 
From  Table  B.2  we  can  only  see  that  this  is  smaller  than  0.0005  (a  statistical 
software  package  gives  P(T  <  —10.66)  =  2.25  •  10“^®).  The  data  provide  over¬ 
whelming  evidence  against  the  null  hypothesis,  so  that  we  conclude  that  dry 
drilling  is  faster  than  wet  drilling. 


Quick  exercise  28.2  Suppose  that  in  the  ball  bearing  example  of  Quick 
exercise  27.2,  we  test  Hq  :  /ii  =  /xg  against  Hi  :  p,i  ^  p,2,  where  ni  and  /rg 
represent  the  diameters  of  a  ball  bearing  from  the  first  and  second  production 
line.  What  are  the  critical  values  corresponding  to  level  a  =  0.01? 


Nonnormal  samples 

Similar  to  the  one-sample  t-test,  if  we  cannot  assume  normal  model  distribu¬ 
tions,  then  we  can  no  longer  conclude  that  our  test  statistic  has  a  tin  -I-  m  —  2) 
distribution  under  the  null  hypothesis.  Recall  that  under  the  null  hypothesis, 
the  distribution  of  our  test  statistic  is  the  same  as  that  of  the  pooled  studen- 
tized  mean  difference  (see  page  417). 

To  approximate  its  distribution,  we  use  the  empirical  bootstrap  simulation 
for  the  pooled  studentized  mean  difference 

i^n  Tm)  (MI  Mg) 

^  ■ 

Given  datasets  xi,X2,  ■  ■  ■  ^Xn  and  j/i,  j/2,  •  ■  • , ?/m,  determine  their  empirical  dis¬ 
tribution  functions  and  Gm  as  estimates  for  F  and  G.  The  expectations 
corresponding  to  and  Gm  are  pL\  =  Xn  and  p,2  =  ym-  Then  repeat  the 
following  two  steps  many  times: 

1 .  Generate  a  bootstrap  dataset  ,  012 ,  • .  ■ ,  a;*  from  and  a  bootstrap 
dataset  yl,y2, .  ■  ■  ,ym  from  Gm- 

2.  Gompute  the  pooled  studentized  mean  difference  for  the  bootstrap  data: 

I*  _  ~  Vm)  ~  ~  Vm) 
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where  x*  and  are  the  sample  means  of  the  bootstrap  datasets,  and 

.  *.2  ^  (n-l)(s^)^  +  (m-l)(4)^  /I  ^ 

^  n  +  m  —  2  \n  mj 

with  and  (s^)^  the  sample  variances  of  the  bootstrap  datasets. 

The  reason  that  in  each  iteration  we  subtract  Xn  —  Vm  is  that  —  fi2  is 
the  difference  of  the  expectations  of  the  two  model  distributions.  Therefore, 
according  to  the  bootstrap  principle  we  should  replace  this  by  the  difference 
Xn  —  Vm  of  the  expectations  corresponding  to  the  two  empirical  distribution 
functions. 

We  carried  out  this  bootstrap  simulation  for  the  drill  times.  The  result  of  this 
simulation  can  be  seen  in  Figure  28.2,  where  a  histogram  and  the  empirical 
distribution  function  are  displayed  for  one  thousand  bootstrap  values  of  t*. 
Suppose  that  we  test  i^o  '-^1=^-2  against  Hi  :  yi  <  ^2  at  level  0.05.  The 
bootstrap  approximation  for  the  left  critical  value  is  c*  =  —1.659.  The  value 
of  tp  =  —10.66,  computed  from  the  data,  is  much  smaller.  Hence,  also  on  the 
basis  of  the  bootstrap  simulation  we  reject  the  null  hypothesis  and  conclude 
that  the  dry  drill  time  is  shorter  than  the  wet  drill  time. 


-4-2  0  2  4  -1.659  0 

Fig.  28.2.  Histogram  and  empirical  distribution  function  of  1000  bootstrap  values 
for  r;. 


28.3  Two  samples  with  unequal  variances 

During  an  investigation  about  weather  modification,  a  series  of  experiments 
was  conducted  in  southern  Florida  from  1968  to  1972.  These  experiments 
were  designed  to  investigate  the  use  of  massive  silver-iodide  seeding.  It  was 
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Table  28.1.  Rainfall  data. 


Unseeded 


1202.6 

830.1 

372.4 

345.5 

321.2 

244.3 

163.0 

147.8 

95.0 

87.0 

81.2 

68.5 

47.3 

41.1 

36.6 

29.0 

28.6 

26.3 

26.1 

24.4 

21.7 

17.3 

11.5 

4.9 

4.9 

1.0 

Seeded 


2745.6 

1697.8 

1656.0 

978.0 

703.4 

489.1 

430.0 

334.1 

302.8 

274.7 

274.7 

255.0 

242.5 

200.7 

198.6 

129.6 

119.0 

118.3 

115.3 

92.4 

40.6 

32.7 

31.4 

17.5 

7.7 

4.1 

Source:  J.  Simpson,  A.  Olsen,  and  J.C.  Eden.  A  Bayesian  analysis  of  a  mul¬ 
tiplicative  treatment  effect  in  weather  modification.  Technometrics,  17:161— 
166,  1975;  Table  1  on  page  162. 


hypothesized  that  under  specified  conditions,  this  leads  to  invigorated  cumulus 
growth  and  prolonged  lifetimes,  thereby  causing  increased  precipitation.  In 
these  experiments,  52  isolated  cumulus  clouds  were  observed,  of  which  26  were 
selected  at  random  and  injected  with  silver-iodide  smoke.  Rainfall  amounts 
(in  acre-feet)  were  recorded  for  all  clouds.  They  are  listed  in  Table  28.1.  To 
investigate  whether  seeding  leads  to  increased  rainfall,  we  test  Hq  :  /ii  =  /i2 
against  Hi  :  /ii  <  /r2,  where  /ii  and  represent  the  rainfall  for  unseeded  and 
seeded  clouds. 

In  Figure  28.3  the  boxplots  of  both  datasets  are  displayed.  From  this  we 
see  that  the  assumption  of  equal  variances  may  not  be  realistic.  Indeed,  this 
is  confirmed  by  the  values  s\  =  77  521  and  Sy  =  423  524  of  the  sample 
variances  of  the  datasets.  This  means  that  we  need  to  test  Hq  :  /ii  =  /i2 
without  the  assumption  of  equal  variances.  As  before,  the  test  statistic  will  be 
a  standardized  version  of  A„  —  but  is  no  longer  an  unbiased  estimator 
for 

Var(X„-i;„)  =  ^  +  ^. 

n  m 

However,  if  we  estimate  and  ay  by  and  Sy,  then  the  nonpooled  variance 


n  m 

is  an  unbiased  estimator  for  Var(Xji  —  Ym)-  This  leads  to  test  statistic 


Td 


-  y,. 


Sd 
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Fig.  28.3.  Boxplots  of  rainfall. 


Again,  we  compare  the  estimator  —  Ym  with  zero  and  standardize  by 
dividing  by  an  estimator  for  the  standard  deviation  of  Xn  —  Ym-  Values  of 
close  to  zero  are  in  favor  of  the  null  hypothesis  Hq  :  =  /i2. 

Quick  exercise  28.3  Consider  the  ball  bearing  example  from  Quick  exer¬ 
cise  27.2.  Compute  the  value  of  Td  for  this  example. 


Under  the  null  hypothesis  Hq  :  =  ^2,  the  test  statistic 

^  Xn-Yrn 

is  equal  to  the  nonpooled  studentized  mean  difference 

{Xn  Ym)  (/il  /X2) 

■ 

Therefore,  the  distribution  of  Td  under  the  null  hypothesis  is  the  same  as  that 
of  the  nonpooled  studentized  mean  difference.  Unfortunately,  its  distribution 
is  not  a  t-distribution,  not  even  in  the  case  of  normal  samples.  This  means 
that  we  have  to  approximate  this  distribution. 

Similar  to  the  previous  section,  we  use  the  empirical  bootstrap  simulation  for 
the  nonpooled  studentized  mean  difference.  The  only  difference  with  the  proce¬ 
dure  outlined  in  the  previous  section  is  that  now  in  each  iteration  we  compute 
the  nonpooled  studentized  mean  difference  for  the  bootstrap  datasets: 


d*  — 
— 


«  -  Vm)  -  {Xn  -  Vm) 


^d 


where  a;*  and  are  the  sample  means  of  the  bootstrap  datasets,  and 

,  (4)^ 


{sdY  = 


n 


m 
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-4  -2  0  2  4  6  -1.405  0 

Fig.  28.4.  Histogram  and  empirical  distribution  function  of  1000  bootstrap  values 


with  and  (Sy)^  the  sample  variances  of  the  bootstrap  datasets. 

We  carried  out  this  bootstrap  simulation  for  the  cloud  seeding  data.  The 
result  of  this  simulation  can  be  seen  in  Figure  28.4,  where  a  histogram  and 
the  empirical  distribution  function  are  displayed  for  one  thousand  values  tj. 
The  bootstrap  approximation  for  the  left  critical  value  corresponding  to  level 
0.05  is  dl  =  —1.405.  For  the  data  we  find  the  value 


td 


164.59-441.98 

138.92 


-1.998. 


This  is  smaller  than  c* ,  so  we  reject  the  null  hypothesis.  Although  the  evidence 
against  the  null  hypothesis  is  not  overwhelming,  there  is  some  indication  that 
seeding  clouds  leads  to  more  rainfall. 


28.4  Large  samples 

Variants  of  the  central  limit  theorem  state  that  as  n  and  m  both  tend  to 
infinity,  the  distributions  of  the  pooled  studentized  mean  difference 

^rn)  (/4l  ^2) 

and  the  nonpooled  studentized  mean  difference 

^rn)  (/4l  ^2) 

both  approach  the  standard  normal  distribution.  This  fact  can  be  used  to 
approximate  the  distribution  of  the  test  statistics  Tp  and  Td  under  the  null 
hypothesis  by  a  standard  normal  distribution. 
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We  illustrate  this  by  means  of  the  following  example.  To  investigate  whether  a 
restricted  diet  promotes  longevity,  two  groups  of  randomly  selected  rats  were 
put  on  the  different  diets.  One  group  of  n  =  106  rats  was  put  on  a  restricted 
diet,  the  other  group  of  m  =  89  rats  on  an  ad  libitum  diet  (i.e.,  unrestricted 
eating).  The  data  in  Table  28.2  represent  the  remaining  lifetime  in  days  of  two 
groups  of  rats  after  they  were  put  on  the  different  diets.  The  average  lifetimes 
are  Xn  =  968.75  and  ym  =  684.01  days.  To  investigate  whether  a  restricted 
diet  promotes  longevity,  we  test  Hq  '■  y,i  =  ^2  against  iJi  :  /ii  >  /i2,  where 
fj,i  and  ^2  represent  the  lifetime  of  a  rat  on  a  restricted  diet  and  on  an  ad 
libitum  diet,  respectively. 

If  we  may  assume  equal  variances,  we  compute 

968.75-  684.01  „ 

t  = - — — - =  8.66. 

^  32.88 

This  value  is  larger  than  the  right  critical  value  2:0.0005  =  3.291,  which  means 
that  we  would  reject  Hq  :  /.ti  =  /i2  in  favor  of  iJi  :  /ii  >  ^2  at  level  a  =  0.0005. 


Table  28.2.  Rat  data. 


Restricted 


105 

193 

211 

236 

302 

363 

389 

390 

391 

403 

530 

604 

605 

630 

716 

718 

727 

731 

749 

769 

770 

789 

804 

810 

811 

833 

868 

871 

875 

893 

897 

901 

906 

907 

919 

923 

931 

940 

957 

958 

961 

962 

974 

979 

982 

1001 

1008 

1010 

1011 

1012 

1014 

1017 

1032 

1039 

1045 

1046 

1047 

1057 

1063 

1070 

1073 

1076 

1085 

1090 

1094 

1099 

1107 

1119 

1120 

1128 

1129 

1131 

1133 

1136 

1138 

1144 

1149 

1160 

1166 

1170 

1173 

1181 

1183 

1188 

1190 

1203 

1206 

1209 

1218 

1220 

1221 

1228 

1230 

1231 

1233 

1239 

1244 

1258 

1268 

1294 

1316 

1327 

1328 

1369 

1393 

1435 

Ad  libitum 


89 

104 

387 

465 

479 

494 

496 

514 

532 

536 

545 

547 

548 

582 

606 

609 

619 

620 

621 

630 

635 

639 

648 

652 

653 

654 

660 

665 

667 

668 

670 

675 

677 

678 

678 

681 

684 

688 

694 

695 

697 

698 

702 

704 

710 

711 

712 

715 

716 

717 

720 

721 

730 

731 

732 

733 

735 

736 

738 

739 

741 

743 

746 

749 

751 

753 

764 

765 

768 

770 

773 

777 

779 

780 

788 

791 

794 

796 

799 

801 

806 

807 

815 

836 

838 

850 

859 

894 

963 

Source:  B.L.  Berger,  D.D.  Boos,  and  F.M.  Guess.  Tests  and  confidence  sets 
for  comparing  two  mean  residual  life  functions.  Biometrics,  44:103-115,  1988. 


424  28  Comparing  two  samples 


The  p- value  is  the  right  tail  probability  P(Tp  >  8.66),  which  we  approximate 
by  P{Z  >  8.66),  where  Z  has  an  iV(0, 1)  distribution.  From  Table  B.l  we  see 
that  this  probability  is  smaller  than  P{Z  >  3.49)  =  0.0002.  By  means  of  a 
statistical  package  we  find  P{Z  >  8.66)  =  2.4  •  10“^®. 

If  we  repeat  the  test  without  the  assumption  of  equal  variances,  we  compute 

968.75  -  684.01 


which  also  leads  to  rejection  of  the  null  hypothesis.  In  this  case,  the  p-value 
P{Td  >  9.16)  «  P{Z  >  9.16)  is  even  smaller  since  9.16  >  8.66  (a  statistical 
package  gives  P{Z  >  9.16)  =  2.6  •  10“^®).  The  data  provide  overwhelming 
evidence  against  the  null  hypothesis,  and  we  conclude  that  a  restricted  diet 
promotes  longevity. 


28.5  Solutions  to  the  quick  exercises 

28.1  Just  by  looking  at  the  boxplots,  the  authors  believe  that  the  assumption 
~  reasonable.  The  lengths  of  the  boxplots  and  their  IQRs  are  almost 

the  same.  However,  the  boxplots  do  not  reveal  how  the  elements  of  the  dataset 
vary  around  the  center.  One  way  of  quantifying  our  belief  would  be  to  compare 
the  sample  variances  of  the  datasets.  One  possibility  is  to  compare  the  ratio  of 
both  sample  variances;  a  ratio  close  to  one  would  support  our  belief  of  equal 
variances  (in  case  of  normal  samples,  this  is  a  standard  test  called  the  F-test). 

28.2  In  this  case  we  have  a  right  and  left  critical  value.  From  Quick  ex¬ 
ercise  27.2  we  know  that  n  =  m  =  10,  so  that  the  right  critical  value  is 
^18. 0.005  =  2.878  and  the  left  critical  value  is  — tis.o.oos  =  —2.878. 

28.3  We  first  compute  =  (0.0290)^/10-1- (0.0428)^/10  =  0.000267  and  then 
td  =  (1.0194  -  1.0406)/V0. 000267  =  -1.297. 


28.6  Exercises 

28.1  □  The  data  in  Table  28.3  represent  salaries  (in  pounds  Sterling)  in  72 
randomly  selected  advertisements  in  the  The  Guardian  (April  6,  1992).  When 
a  range  was  given  in  the  advertisement,  the  midpoint  of  the  range  is  repro¬ 
duced  in  the  table.  The  data  are  salaries  corresponding  to  two  kinds  of  occu¬ 
pations  (n  =  m  =  72):  (1)  creative,  media,  and  marketing  and  (2)  education. 
The  sample  mean  and  sample  variance  of  the  two  datasets  are,  respectively: 

(1)  X72  =  17410  and  =  41258  741, 

(2)  j/72  =  19  818  and  =  50  744  521. 
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Table  28.3.  Salaries  in  two  kinds  of  occupations. 


Occupation  (1)  Occupation  (2) 


17703 

13796 

12000 

25899 

17378 

19236 

42000 

22958 

22900 

21676 

15594 

18780 

18780 

10750 

13440 

15053 

17375 

12459 

15723 

13552 

17574 

19461 

20111 

22700 

13179 

21000 

22149 

22485 

16799 

35750 

37500 

18245 

17547 

17378 

12587 

20539 

22955 

19358 

9500 

15053 

24102 

13115 

13000 

22000 

25000 

10998 

12755 

13605 

13500 

12000 

15723 

18360 

35000 

20539 

13000 

16820 

12300 

22533 

20500 

16629 

11000 

17709 

10750 

23008 

13000 

27500 

12500 

23065 

11000 

24260 

18066 

17378 

13000 

18693 

19000 

25899 

35403 

15053 

10500 

14472 

13500 

18021 

17378 

20594 

12285 

12000 

32000 

17970 

14855 

9866 

13000 

20000 

17783 

21074 

21074 

21074 

16000 

18900 

16600 

15053 

19401 

25598 

15000 

14481 

18000 

20739 

15053 

15053 

13944 

35000 

11406 

15053 

15083 

31530 

23960 

18000 

23000 

30800 

10294 

16799 

11389 

30000 

15379 

37000 

11389 

15053 

12587 

12548 

21458 

48000 

11389 

14359 

17000 

17048 

21262 

16000 

26544 

15344 

9000 

13349 

20000 

20147 

14274 

31000 

Source:  D.J.  Hand,  F.  Daly,  A.D.  Lunn,  K.J.  McConway,  and  E.  Ostrowski. 
Small  data  sets.  Chapman  and  Hall,  London,  1994;  dataset  385.  Data  col¬ 
lected  by  D.J.  Hand. 


Suppose  that  the  datasets  are  modeled  as  realizations  of  normal  distributions 
with  expectations  /ri  and  /i2,  which  represent  the  salaries  for  occupations  (1) 
and  (2). 

a.  Test  the  null  hypothesis  that  the  salary  for  both  occupations  is  the  same 
at  level  a  =  0.05  under  the  assumption  of  equal  variances.  Formulate 
the  proper  null  and  alternative  hypotheses,  compute  the  value  of  the  test 
statistic,  and  report  your  conclusion. 

b.  Do  the  same  without  the  assumption  of  equal  variances. 

c.  As  a  comparison,  one  carries  out  an  empirical  bootstrap  simulation  for  the 
nonpooled  studentized  mean  difference.  The  bootstrap  approximations  for 
the  critical  values  are  cj"  =  —2.004  and  c*  =  2.133.  Report  your  conclusion 
about  the  salaries  on  the  basis  of  the  bootstrap  results. 
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28.2  The  data  in  Table  28.4  represent  the  duration  of  pregnancy  for  1669 
women  who  gave  birth  in  a  maternity  hospital  in  Newcastle-upon-Tyne,  Eng¬ 
land,  in  1954. 


Table  28.4.  Durations  of  pregnancy. 


Duration 

Medical 

Emergency 

Social 

11 

1 

15 

1 

17 

1 

20 

1 

22 

1 

2 

24 

1 

3 

25 

2 

1 

26 

1 

27 

2 

2 

1 

28 

1 

2 

1 

29 

3 

1 

30 

3 

5 

1 

31 

4 

5 

2 

32 

10 

9 

2 

33 

6 

6 

2 

34 

12 

7 

10 

35 

23 

11 

4 

36 

26 

13 

19 

37 

54 

16 

30 

38 

68 

35 

72 

39 

159 

38 

115 

40 

197 

32 

155 

41 

111 

27 

128 

42 

55 

25 

64 

43 

29 

8 

16 

44 

4 

5 

3 

45 

3 

1 

6 

46 

1 

1 

1 

47 

1 

56 

1 

Source:  D.J.  Newell.  Statistical  aspects  of  the  demand  for  maternity  beds. 
Journal  of  the  Royal  Statistical  Society,  Series  A,  127:1—33,  1964. 


The  durations  are  measured  in  complete  weeks  from  the  beginning  of  the  last 
menstrual  period  until  delivery.  The  pregnancies  are  divided  into  those  where 
an  admission  was  booked  for  medical  reasons,  those  booked  for  social  reasons 
(such  as  poor  housing),  and  unbooked  emergency  admissions.  For  the  three 
groups  the  sample  means  and  sample  variances  are 
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Medical:  775  observations  with  x  =  39.08  and  =  7.77, 

Emergency:  261  observations  with  x  =  37.59  and  =  25.33, 

Social:  633  observations  with  x  =  39.60  and  =  4.95. 

Suppose  we  view  the  datasets  as  realizations  of  random  samples  from  normal 
distributions  with  expectations  /ii,  ^2,  and  /ra  and  variances  a\,  erf,  and  (t|, 
where  /ii  represents  the  duration  of  pregnancy  for  the  women  from  the  ith 
group.  We  want  to  investigate  whether  the  duration  differs  for  the  different 
groups.  For  each  combination  of  two  groups  test  the  null  hypothesis  of  equality 
of  /ii.  Compute  the  values  of  the  test  statistic  and  report  your  conclusions. 

28.3  □  In  a  seven-day  study  on  the  effect  of  ozone,  a  group  of  23  rats  was 
kept  in  an  ozone-free  environment  and  a  group  of  22  rats  in  an  ozone-rich 
environment.  From  each  member  in  both  groups  the  increase  in  weight  (in 
grams)  was  recorded.  The  results  are  given  in  Table  28.5.  The  interest  is  in 
whether  ozone  affects  the  increase  of  weight.  We  investigate  this  by  testing 
Hq  :  /ii  =  /i2  against  i7i  :  /ri  yf  /r2,  where  /ii  and  /i2  denote  the  increases  of 
weight  for  a  rat  in  the  ozone-free  and  ozone-rich  groups.  The  sample  means 
are 

Ozone-free:  2:23  =  22.40 
Ozone-rich:  ^22  =  11.01. 

The  pooled  standard  deviation  is  Sp  =  4.58,  and  the  nonpooled  standard 
deviation  is  Sd  =  4.64. 


Table  28.5.  Weight  increase  of  rats. 


Ozone-free  Ozone-rich 


41.0 

38.4 

24.4 

10.1 

6.1 

20.4 

25.9 

21.9 

18.3 

7.3 

14.3 

15.5 

13.1 

27.3 

28.5 

-9.9 

6.8 

28.2 

-16.9 

17.4 

21.8 

17.9 

-12.9 

14.0 

15.4 

27.4 

19.2 

6.6 

12.1 

15.7 

22.4 

17.7 

26.0 

39.9 

-15.9 

54.6 

29.4 

21.4 

22.7 

-14.7 

44.1 

-9.0 

26.0 

26.6 

-9.0 

Source:  K.A.  Doksum  and  G.L.  Sievers.  Plotting  with  confidence:  graphical 
comparisons  of  two  populations.  Biometrika,  63(3) :421— 434,  1976;  Table  10 
on  page  433.  By  permission  of  the  Biometrika  Trustees. 


a.  Perform  the  test  at  level  0.05  under  the  assumption  of  normal  data  with 
equal  variances,  i.e.,  compute  the  test  statistic  and  report  your  conclusion. 

b.  One  also  carries  out  a  bootstrap  simulation  for  the  test  statistic  used  in 
a,  and  finds  critical  values  c*  =  —1.912  and  c*  =  1.959.  What  is  your 
conclusion  on  the  basis  of  the  bootstrap  simulation? 
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c.  Also  perform  the  test  at  level  0.05  without  the  assumption  of  equal  vari¬ 
ances,  where  you  may  use  the  normal  approximation  for  the  distribution 
of  the  test  statistic  under  the  null  hypothesis. 

d.  A  bootstrap  simulation  for  the  test  statistic  in  c  yields  that  the  right  tail 
probability  corresponding  to  the  observed  value  of  the  test  statistic  in 
this  case  is  0.014.  What  is  your  conclusion  on  the  basis  of  the  bootstrap 
simulation? 

28.4  Show  that  in  the  case  when  n  =  m,  the  random  variables  Tp  and  Td  are 
the  same. 


28.5  ffl  Let  Ai,  A2, . . . ,  A„  and  Fi, I2,  ■  ■  ■ , dm  be  independent  random  sam¬ 
ples  from  normal  distributions  with  variances  cr^.  It  can  be  shown  that 


Var(5i) 


2cr4 
n  —  1 


and  Var(S'y) 


2cr^ 
m  —  1 


Consider  linear  combinations  -I-  bSy  that  are  unbiased  estimators  for  tr^ . 

a.  Show  that  a  and  b  must  satisfy  a  +  b  =  1. 

b.  Show  that  Var(aS'|-  -I-  (1  —  0)8^)  is  minimized  for  a  =  {n—l)/{n  +  m  —  2) 
(and  hence  6  =  (m  —  1) /(n  -|-  m  —  2)). 

28.6  Let  Xi,X2, . . . ,  A„  and  Fi,  F2,  •  ■  ■  j  Ym  be  independent  random  samples 

from  distributions  with  (possibly  unequal)  variances  cr^  and  cry. 

a.  Show  that 

Var(A„-L;„)  =  ^  +  ^. 

n  m 

b.  Show  that  the  pooled  variance  Sp,  as  defined  on  page  417,  is  a  biased 
estimator  for  Var(A„  —  F^). 

c.  Show  that  the  nonpooled  variance  as  defined  on  page  420,  is  the  only 
unbiased  estimator  for  Var(A„  —  Ym)  of  the  form  aS\  +  bSy- 

d.  Suppose  that  crj^  =  ay  =  Show  that  S'^,  as  defined  on  page  417,  is  an 
unbiased  estimator  for  Var(A„  —  F^)  =  cr^(l/n-|-  1/m). 

e.  Is  also  an  unbiased  estimator  for  Var(A„  —  F^)  in  the  case  crj^  7^ 
What  about  when  n  =  m? 


A 


Summary  of  distributions 


Discrete  distributions 

1.  Bernoulli  distribution:  Ber{p),  where  0  <  p  <  1. 

F{X  =  l)=p  and  P(X  =  0)  =  l-p. 

F,[X]=p  and  Var(X)  =  p(l  —  p). 

2.  Binomial  distribution:  Bin{n,p),  where  0  <  p  <  1. 

P{X  =  k)=  for  fc  = 

P[X]=np  and  Var(X)  =  np(l  —  p). 

3.  Geometric  distribution:  Geo{p),  where  0  <  p  <  1. 

P(X  =  k)  =  p(l  —  p)^~^  for  A:  =  1,  2, . . .  . 

P[X\  =  \/p  and  Var(X)  =  (1  — 

4.  Poisson  distribution:  Pois{p),  where  p  >  0. 

p(X  =  k)  =  ^e-t^  for  fc  =  0,l,...  . 
k\ 

Pi[X]=p  and  Y&r{X)  =  p. 


Continuous  distributions 


1.  Cauchy  distribution:  Cau{a,P),  where  — oo  <  a  < 


fix)  = 


/3 


TT  (/32  +  (a:  —  a)^) 
1  1 


^  ^  /'X  —  a\ 

F(x)  =  -  +  -arcta„(— j 

E[X]  and  Var(X)  do  not  exist. 


for  — oo  <  X  <  oo. 

for  — oo  <  X  <  oo. 


00  and  /3  >  0. 


430  A  Summary  of  distributions 


2.  Exponential  distribution:  Exp{X),  where  A  >  0. 
f{x)  =  Xe~^^  for  a;  >  0. 

F{x)  =  1  —  for  a;  >  0. 

E[X]  =  1/A  and  Var(A:)  =  l/A^. 


3.  Gamma  distribution:  Gam{a,X),  where  a  >  0  and  A  >  0. 
A  (Aa:)'^ 


fix)  = 

F{x)  =  r 

Jo 


^a-le-Ax 


r(a) 

A  (At) 


for  a:  >  0. 


a-i  _-At 


r(a) 


■  dt  for  X  >  0. 


E  [X]  =  a/X  and  Var(A')  =  a/X^. 

4.  Normal  distribution:  where  — oo  <  fj,  <  oo  and  cr  >  0. 


fix)  = 


1  / 


for  — cx)  <  X  <  oo. 


Fix)  = 


f 


1 


t  —  fl 


f  —OO  (ta/^ 

F[X]=pL  and  Var  (X  )  =  cr^ 


dt  for  —00  <  X  <  oo. 


5.  Pareto  distribution:  Par  {a),  where  a  >  0. 
a 


fix)  = 


X 


a+l 


for  X  >  1. 


F{x)  =  1  —  x““  for  X  >  1. 

E  [X]  =  a/{a  —  1)  for  a  >  1  and  oo  for  0  <  a  <  1. 

Var(X)  =  a/ {{a  —  1)^(q!  —  2))  for  a  >  2  and  oo  for  0  <  a  <  1. 

6.  Uniform  distribution:  U{a,b),  where  a  <  b. 

fix)  =  -  for  a  <  X  <  b. 

b  —  a 

X  —  Q> 

Fix)  =  1 -  for  a  <  X  <  6. 

b  —  a 

E[X]  =  (a  +  6)/2  and  Var(X)  =  (6  —  a)^/12. 


B 
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Table  B.l.  Right  tail  probabilities  1  —  $(0)  =  P(Z  >  a)  for  an  JV(0, 1)  distributed 
random  variable  Z. 


a 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0 

5000 

4960 

4920 

4880 

4840 

4801 

4761 

4721 

4681 

4641 

0.1 

4602 

4562 

4522 

4483 

4443 

4404 

4364 

4325 

4286 

4247 

0.2 

4207 

4168 

4129 

4090 

4052 

4013 

3974 

3936 

3897 

3859 

0.3 

3821 

3783 

3745 

3707 

3669 

3632 

3594 

3557 

3520 

3483 

0.4 

3446 

3409 

3372 

3336 

3300 

3264 

3228 

3192 

3156 

3121 

0.5 

3085 

3050 

3015 

2981 

2946 

2912 

2877 

2843 

2810 

2776 

0.6 

2743 

2709 

2676 

2643 

2611 

2578 

2546 

2514 

2483 

2451 

0.7 

2420 

2389 

2358 

2327 

2296 

2266 

2236 

2206 

2177 

2148 

0.8 

2119 

2090 

2061 

2033 

2005 

1977 

1949 

1922 

1894 

1867 

0.9 

1841 

1814 

1788 

1762 

1736 

1711 

1685 

1660 

1635 

1611 

1.0 

1587 

1562 

1539 

1515 

1492 

1469 

1446 

1423 

1401 

1379 

1.1 

1357 

1335 

1314 

1292 

1271 

1251 

1230 

1210 

1190 

1170 

1.2 

1151 

1131 

1112 

1093 

1075 

1056 

1038 

1020 

1003 

0985 

1.3 

0968 

0951 

0934 

0918 

0901 

0885 

0869 

0853 

0838 

0823 

1.4 

0808 

0793 

0778 

0764 

0749 

0735 

0721 

0708 

0694 

0681 

1.5 

0668 

0655 

0643 

0630 

0618 

0606 

0594 

0582 

0571 

0559 

1.6 

0548 

0537 

0526 

0516 

0505 

0495 

0485 

0475 

0465 

0455 

1.7 

0446 

0436 

0427 

0418 

0409 

0401 

0392 

0384 

0375 

0367 

1.8 

0359 

0351 

0344 

0336 

0329 

0322 

0314 

0307 

0301 

0294 

1.9 

0287 

0281 

0274 

0268 

0262 

0256 

0250 

0244 

0239 

0233 

2.0 

0228 

0222 

0217 

0212 

0207 

0202 

0197 

0192 

0188 

0183 

2.1 

0179 

0174 

0170 

0166 

0162 

0158 

0154 

0150 

0146 

0143 

2.2 

0139 

0136 

0132 

0129 

0125 

0122 

0119 

0116 

0113 

0110 

2.3 

0107 

0104 

0102 

0099 

0096 

0094 

0091 

0089 

0087 

0084 

2.4 

0082 

0080 

0078 

0075 

0073 

0071 

0069 

0068 

0066 

0064 

2.5 

0062 

0060 

0059 

0057 

0055 

0054 

0052 

0051 

0049 

0048 

2.6 

0047 

0045 

0044 

0043 

0041 

0040 

0039 

0038 

0037 

0036 

2.7 

0035 

0034 

0033 

0032 

0031 

0030 

0029 

0028 

0027 

0026 

2.8 

0026 

0025 

0024 

0023 

0023 

0022 

0021 

0021 

0020 

0019 

2.9 

0019 

0018 

0018 

0017 

0016 

0016 

0015 

0015 

0014 

0014 

3.0 

0013 

0013 

0013 

0012 

0012 

0011 

0011 

0011 

0010 

0010 

3.1 

0010 

0009 

0009 

0009 

0008 

0008 

0008 

0008 

0007 

0007 

3.2 

0007 

0007 

0006 

0006 

0006 

0006 

0006 

0005 

0005 

0005 

3.3 

0005 

0005 

0005 

0004 

0004 

0004 

0004 

0004 

0004 

0003 

3.4 

0003 

0003 

0003 

0003 

0003 

0003 

0003 

0003 

0003 

0002 
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Table  B.2.  Right  critical  values  tm,p  of  the  t-distribution  with  m  degrees  of  freedom 
corresponding  to  right  tail  probability  p:  P{Tm  >  tm,p)  =  P-  The  last  row  in  the  table 
contains  right  critical  values  of  the  N{0, 1)  distribution:  too.p  =  Zp. 


Right  tail  probability  p 


m 

0.1 

0.05 

0.025 

0.01 

0.005 

0.0025 

0.001 

0.0005 

1 

3.078 

6.314 

12.706 

31.821 

63.657 

127.321 

318.309 

636.619 

2 

1.886 

2.920 

4.303 

6.965 

9.925 

14.089 

22.327 

31.599 

3 

1.638 

2.353 

3.182 

4.541 

5.841 

7.453 

10.215 

12.924 

4 

1.533 

2.132 

2.776 

3.747 

4.604 

5.598 

7.173 

8.610 

5 

1.476 

2.015 

2.571 

3.365 

4.032 

4.773 

5.893 

6.869 

6 

1.440 

1.943 

2.447 

3.143 

3.707 

4.317 

5.208 

5.959 

7 

1.415 

1.895 

2.365 

2.998 

3.499 

4.029 

4.785 

5.408 

8 

1.397 

1.860 

2.306 

2.896 

3.355 

3.833 

4.501 

5.041 

9 

1.383 

1.833 

2.262 

2.821 

3.250 

3.690 

4.297 

4.781 

10 

1.372 

1.812 

2.228 

2.764 

3.169 

3.581 

4.144 

4.587 

11 

1.363 

1.796 

2.201 

2.718 

3.106 

3.497 

4.025 

4.437 

12 

1.356 

1.782 

2.179 

2.681 

3.055 

3.428 

3.930 

4.318 

13 

1.350 

1.771 

2.160 

2.650 

3.012 

3.372 

3.852 

4.221 

14 

1.345 

1.761 

2.145 

2.624 

2.977 

3.326 

3.787 

4.140 

15 

1.341 

1.753 

2.131 

2.602 

2.947 

3.286 

3.733 

4.073 

16 

1.337 

1.746 

2.120 

2.583 

2.921 

3.252 

3.686 

4.015 

17 

1.333 

1.740 

2.110 

2.567 

2.898 

3.222 

3.646 

3.965 

18 

1.330 

1.734 

2.101 

2.552 

2.878 

3.197 

3.610 

3.922 

19 

1.328 

1.729 

2.093 

2.539 

2.861 

3.174 

3.579 

3.883 

20 

1.325 

1.725 

2.086 

2.528 

2.845 

3.153 

3.552 

3.850 

21 

1.323 

1.721 

2.080 

2.518 

2.831 

3.135 

3.527 

3.819 

22 

1.321 

1.717 

2.074 

2.508 

2.819 

3.119 

3.505 

3.792 

23 

1.319 

1.714 

2.069 

2.500 

2.807 

3.104 

3.485 

3.768 

24 

1.318 

1.711 

2.064 

2.492 

2.797 

3.091 

3.467 

3.745 

25 

1.316 

1.708 

2.060 

2.485 

2.787 

3.078 

3.450 

3.725 

26 

1.315 

1.706 

2.056 

2.479 

2.779 

3.067 

3.435 

3.707 

27 

1.314 

1.703 

2.052 

2.473 

2.771 

3.057 

3.421 

3.690 

28 

1.313 

1.701 

2.048 

2.467 

2.763 

3.047 

3.408 

3.674 

29 

1.311 

1.699 

2.045 

2.462 

2.756 

3.038 

3.396 

3.659 

30 

1.310 

1.697 

2.042 

2.457 

2.750 

3.030 

3.385 

3.646 

40 

1.303 

1.684 

2.021 

2.423 

2.704 

2.971 

3.307 

3.551 

50 

1.299 

1.676 

2.009 

2.403 

2.678 

2.937 

3.261 

3.496 

OO 

1.282 

1.645 

1.960 

2.326 

2.576 

2.807 

3.090 

3.291 

c 


Answers  to  selected  exercises 


2.1  P(AUB)  =  13/18. 

2.4  Yes. 

2.7  0.7. 

2.8  P(Di  U  H2)  <  2  •  10“®  and 
P(L>i  nD2)  <  10"®. 

2.11  p  =  (-l  +  ^)/2. 

2.12  a  1/10! 

2.12  b  5!  •  5! 


2.12  c  8/63  =  12.7  percent. 


2.14a 


a 

b 

c 

a 

0 

1/6 

1/6 

b 

0 

0 

1/3 

c 

0 

1/3 

0 

p({( 

a, 

b),  (a 

,c)})  = 

2.14c  P({(6,c),(c,6)})  =  2/3. 
2.16  P(£;)  =  2/3. 

2.19a  n  =  {2,3,4,..  .}. 

2.19  b  4p^{l-pf. 

3.1  7/36. 

3.2  a  P(A|  B)  =  2/11. 

3.2  b  No. 


3.3a  P(S'i)  =  13/52  =  1/4, 
P(52|5'i)  =  12/51,  and 
P(52  I  SI)  =  13/51. 

3.3  b  P(S2)  =  1/4. 


3.4  P(B|  T)  =  9.1  •  10"®  and 
P(B|T'^)  =  4.3  •  10“®. 

3.7a  P{AUB)  =  1/2. 

3.7b  P(B)  =  1/3. 

3.8a  P(1Y)  =  0.117. 

3.8  b  P(F|  lY)  =  0.846. 

3.9  P(B|^)  =  7/15. 

3.14  a  P{W  \R)  =  0  and  P{W  \  R^)  =  1. 
3.14b  P(H/)  =  2/3. 

3.16  a  P(B|T)  =  0.165. 

3.16  b  0.795. 

4.1a  a  0  1  2 

pz{a)  25/36  10/36  1/36 
Z  has  a  Bm(2, 1/6)  distribution. 

4.1b  {M  =  2,^  =  0}  =  1(2,1),  (1,2), 
(2,2)},  {S  =  5,^=  1}  =  0,  and 
{5  =  8,Z  =  1}  =  {(6,2),  (2,6)}. 

P(M  =  2,^  =  0)  =  1/12, 

P^B  =  5,  Z  =  1)  =  0,  and 
P^B  =  8,Z  =  1)  =  1/18. 

4.1  c  The  events  are  dependent. 

4.3  a  0  1/2  3/4 

p(o)  1/3  1/6  1/2 

4.6  a  Px(l)  =Px(3)  =  1127,  p^WS)  = 
Px{S/3)  =  3/27,  Px(5/3)  =  Pxi7/3)  = 
6/27,  and  Px{2)  =  7/27. 

4.6  b  6/27. 
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4.7  a  B2n(1000,  0.001). 

4.7  b  P{X  =  0)  =  0.3677,  P(A  =  1)  = 
0.3681,  and  P{X  >  2)  =  0.0802. 

4.8  a  Bin  (6, 0.8178). 

4.8  b  0.9999634. 

4.10  a  Determine  P(i?i  =  0)  first. 

4.10  b  No! 

4.10  c  See  the  birthday  problem  in  Sec¬ 
tion  3.2. 

4.12  No! 

4.13a  Geo{l/N). 

4.13  b  Let  Di  be  the  event  that  the 
marked  bolt  was  drawn  (for  the  first 
time)  in  the  ith  draw,  and  use  condi¬ 
tional  probabilities  in 

P(y  =  k)  =  PiDi  n  •  •  •  n  di_^  n  d*,). 

4.13  c  Count  the  number  of  ways  the 
event  {Z  —  k}  can  occur,  and  divide  this 
by  the  number  of  ways  ('^)  we  can  select 
r  objects  from  N  objects. 

5.2  P(l/2  <  A  <  3/4)  =  5/16. 

5.4  a  P{X  <41/2)  =  1/4. 

5.4b  P(A  =  5)=l/2. 

5.4  c  A  is  neither  discrete  nor  continn- 
ous! 

5.5  a  c  =  1. 

5.5  b  F{x)  =  0  for  x  <  —3; 

F{x)  =  {x  +  3)72  for  -3<x<  -2; 
Fix)  =  1/2  for  -2  <  a;  <  2; 

Fix)  =  1  -  (3  -  xf/2  for  2  <  a;  <  3; 
F{x)  =  1  for  a;  >  3. 

5.8a  g{y)  =  l/(2yry). 

5.8  b  Yes. 

5.8  c  Consider  P'(r/ 10). 

5.9  a  1/2  and  {(a;,  y)  :  2  <x  <  3, 1  <  i/  < 
3/2}. 

5.9  b  F{x)  =  0  for  a:  <  0; 

F(x)  =  2x  for  0  <  a;  <  1/2; 

F{x)  =  1  for  X  >  1/2. 

5.9  c  /(a;)  =  2  for  0  <  a;  <  1/2; 
f{x)  =  0  elsewhere. 

5.12  2. 


5.13  a  Change  variables  from  x  to  —x. 

5.13  b  P{Z  <  -2)  =  0.0228. 

6.2  a  1  +  2V0.378  •  •  •  =  2.2300795. 

6.2  b  Smaller. 

6.2  c  0.3782739. 

6.5  Show,  for  a  >  0,  that  A  <  a  is 
equivalent  with  U  >  e““. 

6.6  U  =  e-^^. 

6.7  Z  =  7“  lii(l  —  U)/5,  or 
^  =  7-lnC//5. 

6.9  a  6/8. 

6.9b  Geo(6/8). 

6.10  a  Define  Bi  =  1  if  Ui  <  p  and 
Bi  =  0  if  Ui  >  p,  and  N  as  the  posi¬ 
tion  in  the  sequence  of  Bi  where  the  first 
1  occurs. 

6.10  b  P(Z  >  n)  =  (1  —  p)",  for  n  = 
0, 1, . . .;  Z  has  a  Geo(p)  distribution. 

7.1  a  Outcomes:  1,  2,  3,  4,  5,  and  6.  Each 
has  probability  1/6. 

7.1b  E[r]  =  7/2,  Var(r)  =  35/12. 

7.2  a  E[A]  =  1/5. 

7.2  b  y  0  1 

P{Y  =  y)  2/5  3/5 
and  E  [Y]  =  3/5. 

7.2  c  E[a2]  =3/5. 

7.2  d  Var(A)  =  14/25. 

7.5  E  [A]  =  p  and  Var(A)  =  p(l  —  p). 

7.6  195/76. 

7.8  E[A]  =  l/3. 

7.10  a  E[A]  =  1/A  and  E  [A^]  =  2/A^ 
7.10  b  Var(A)  =  1/A^ 

7.11a  2. 

7.11b  The  expectation  is  infinite! 
7.11c  E[A]=  X  ■  ax~°‘~^  Ax. 

7.15  a  Start  with 

Var(rA)  =  E  [(rA  -  E  [rX]f]  . 

7.15  b  Start  with  Var(A  -|-  s)  = 

E[((A  +  s)-E[A  +  s])2]. 

7.15  c  Apply  b  with  rX  instead  of  A. 

7.16  E[A]=4/9. 
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7.17  a  If  positive  terms  add  to  zero, 
they  must  all  be  zero. 

7.17  b  Note  that 
E[(F-E[1/])2]  =  Var(E). 

8.1  y  0  10  20 


p(y  = 

y) 

0.2  1 

0.4  0 

.4 

y 

-1 

0 

1 

p(y  = 

=  y) 

1/6 

1/2 

1/3 

-1 

0 

1 

P(Z  = 

=  z) 

1/3 

1/2 

1/6 

8.2  c  ¥{W  =  1)  =  1. 

8.3  a  V  has  a  f7(7,  9)  distribution. 

8.3  b  rU  +  s  has  a  f7(s,  s  +  r)  distribu¬ 
tion  if  r  >  0  and  a  U (s-l-r,  s)  distribution 
if  r  <  0. 

8.5  a  x^{3  —  a;)/4  for  0  <  X  <  2. 

8.5  b  Fviy)  =  (3/4)i/4  -  (l/4)j/«  for  0  < 

y<V2. 

8.5  c  3y^  -  (3/2)i/®  for  0  <  y  <  ^2, 

0  elsewhere. 

8.8  Fwiw)  =  1  —  e  ,  with  7  =  A“. 

8.10  0.1587. 

8.11  Apply  Jensen  with  —g. 

8.12  a  y  0  1  10  100 

ny  =  v)  3  3  i  V 

8.12  b  ^eI^  >  e|^Vx]  . 

8.12  c  =  50.25,  but  E  [Vx]  = 

27.75. 

8.18  V  has  an  exponential  distribution 
with  parameter  nX. 

8.19  a  The  upper  right  quarter  of  the 
circle. 

8.19b  Ez(t)  =  1/2 -b  arctan(t)/7r. 
8.19c  + z^)]. 

9.2  a  P(X  =  0,  y  =  -1)  =  1/6, 

p(x  =  o,y  =  1)  =  0, 
p(x  =  i,y  =  -i)  =  1/6, 
p(x  =  2,y  =  -1)  =  1/6, 

and  P(X  =  2,  y  =  1)  =  0. 


9.2  b  Dependent. 


9.5  a 

1/16 

VI 

VI 

1/4. 

9.5  b 

No. 

9.6  a 

u 

V 

0 

1 

2 

0 

1/4 

0 

1/4 

1/2 

1 

0 

1/2 

0 

1/2 

1/4 

1/2 

1/4 

1 

9.6  b 

Dependent 

9.8  a 

z 

0  1 

2  : 

3 

Pz{z) 

9.8  b  z 


i  i  i  i 

4  4  4  4 


-2-10123 


1  1  i  i  1 
8  4  4  8  8 

^-2x 


Pxi^) 

9.9a  Fx{x)  =  1  —  e“^^  for  x  >  0  and 
Fviy)  =  1  —  for  y  >  0. 

9.9  b  f{x,y)  =  for  x  >  0  and 

y>0. 

9.9  c  fx{x)  =  2e“^^  x  >  0  and  fviy)  = 
e~^  for  y  >  0. 

9.9  d  Independent. 

9.10  a  41/720. 

9.10  b  F{a,b)  =  -b  la^b^. 

9.10c  Fxia)  =  a^. 

9.10 d  fx{x)  —  2x  for  0  <  x  <  1. 

9.10  e  Independent. 

9.11  27/50. 

9.13a  I/tt. 

9.13  b  Fii{r)  =  for  0  <  r  <  1. 

9.13c  /x(x)  =  |V1  -  x'^  =  /y(x)  for 
X  between  —1  and  1. 

9.15  a  Since  F(a,h)  =  (^nD(a,b)) 

where  □(«,  6)  is  the  set  of  points  (x,y), 
for  which  x  <  a  and  y  <  b,  one  needs  to 
calculate  the  areas  for  the  various  cases. 

9.15  b  f{x,y)  =  2  for  (x,y)  G  A,  and 
/(x,  y)  =  0  otherwise. 

9.15  c  Use  the  rule  on  page  122. 

9.19  a  a  =  5\/2,  b  =  4^/2,  and  c  =  18. 
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9.19  b  Use  that  — ^e  2  (  )  is 

the  probability  density  function  of  an 
distributed  random  variable. 
9.19c  ^(0, 1/36). 

10.1a  Cov(A',  y)  =  0.142.  Positively 
correlated. 

10.1b  piX,Y)  =  0.0503. 

10.2  a  E[Xy]  =  0. 

10.2  b  Cov(X,y)  =  0. 

10.2  c  Var(X  +  y)  =  4/3. 

10.2  d  Var(A-y)  =  4/3. 

10.5  a 


6 

a 

0 

1 

2 

0 

8/72 

6/72 

10/72 

1/3 

1 

12/72 

9/72 

15/72 

1/2 

2 

4/72 

3/72 

5/72 

1/6 

1/3 

1/4 

5/12 

1 

10.5  b  E[A]  =  13/12,  E[y]  =  5/6,  and 
Cov(A,  y)  =  0. 

10.5  c  Yes. 

10.6  a  EfA]  =  E[y]  =  0  and 
Cov(Y,y)  =  0. 

10.6b  E[Y]  =  E[y]  =  c;  E[Xy]  =  c^. 

10.6  c  No. 

10.7  a  Cov(X,y)  =  -1/8. 

10.7b  p(X,Y)  =  -1/2. 

10.7  c  For  e  equal  to  1/4,  0  or  —1/4. 

10.9  a  P(Xi  =  1)  =  (1-0.001)'‘°  =  0.96 
and  P(Xi  =  41)  =  0.04. 

10.9  b  E[Ai]  =  2.6  and 
E  [Xi  +  •  •  •  +  Y25]  =  65. 

10.10  a  E[Y]  =  109/50, 

E  [y]  =  157/100,  and  E  [A  +  y]  =  15/4. 

10.10  b  E[A2]  =  1287/250, 

E  [y2]  =  318/125,  and 
E[A  +  y]  =  3633/250. 

10.10  c  Var(A)  =  989/2500, 

Var(y)  =  791/10  000,  and 
Var(A  +  y)  =  4747/10  000. 


10.14  a  Use  the  alternative  expression 
for  the  covariance. 

10.14  b  Use  the  alternative  expression 
for  the  covariance. 


10.14  c  Combine  parts  a  and  b. 
10.16a  Var(A) +Cov(A,y). 

10.16  b  Anything  can  happen. 

10.16  c  X  and  X  -\-Y  are  positively  cor¬ 
related. 


10.18  Solve  0  =  A(iV-l)(iV-bl)/12-b 
iV(A-  1)Cov(Ai,A2). 

11.1  a  Check  that  for  k  between  2  and  6, 

the  summation  runs  over  I  =  l,...,fc  —  1, 
whereas  for  k  between  7  and  12  it  runs 
over  i  =  k  —  ,12. 

11.1b  Check  that  for  2  <  k  <  N,  the 
summation  runs  over  £  =  l,...,fc  —  1, 
whereas  for  k  between  N  +  1  and  2 A  it 
runs  over  i  =  k  —  N, . . . ,  2N. 

11.2a  Check  that  the  summation  runs 
over  £  =  0,1, . . .  ,k. 

11.2  b  Use  that  /{X+p)'‘  is  equal 

top^(l— p)*’  with  p  =  p/(A  + /r). 
11.4a  E[Z]  =  —3  and  Yax{Z)  =  81. 

11.4  b  Z  has  an  A(— 3, 81)  distribution. 

11.4  c  P{Z  <  6)  =  0.8413. 

11.5  Check  that  for  0  <  z  <  1,  the  in¬ 
tegral  runs  over  0  <  y  <  z,  whereas  for 
1  <  z  <  2,  it  runs  over  2  —  1  <  p  <  1. 


11.6  Check  that  the  integral  runs  over 
0  <  p  <  2. 


11.7  Recall  that  a  Gam{k,X)  random 
variable  can  be  represented  as  the  sum  of 
k  independent  Exp{X)  random  variables. 

11.9a  fz{z)  =  lor  z>  1. 


11.9b  fz{z) 
for  2:  >  1. 


fJ: _ 

(3  —  a 


12. le  1  :  no,  2:  no,  3:  okay,  4:  okay,  5: 
okay. 

12.5  a  0.00049. 
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12.5b  1  (correct  to  8281  decimals). 

12.6  0.256. 

12.7a  A  «  0.192. 

12.7  b  0.1583  is  close  to  0.147. 

12.7c  2.71  •  10“®. 

12.8a  E[X{X-  1)]  = 

12.8  b  Var(A')  =  /r. 

12.11  The  probability  of  the  event  in 
the  hint  equals  (As)"e“'^^'*/(A:!(n  —  k)\). 

12.14  a  Note:  1  — 1/n  — >  1  and  1/n  ^  0. 

12.14b  E[Xr,]  =  (1  -  1/n)  •  0  +  (1/n)  • 
7n  =  7. 

13.2  a  E[A‘i]  =  0  and  Var(A‘i)  =  1/12. 

13.2  b  1/12. 

13.4a  n  >  63. 

13.4  b  n  >  250. 

13.4  c  n  >  125. 

13.4  d  n  >  240. 

13.6  Expected  income  per  game  €  1/37; 
per  year:  €  9865. 

13.8a  Var(Yn/2/i)  =  0.171/h^/n. 

13.8  b  n  >  801. 

13.9a  r„  is  the  average  of  a  sequence  of 
independent  and  identically  distributed 
random  variables. 

13.9b  a  =  E[Xf]  =1/3. 

13.10  a  P(|M„  -  1|  >  e)  =  (1  -  e)"  for 
0  <  e  <  1. 

13.10  b  No. 

14.2  0.9977. 

14.3  17. 

14.4  1/2. 

14.5  Use  that  A  has  the  same  probabil¬ 
ity  distribution  as  Xi  +  X2  •  -I- 
where  Xi,  X2,  ■  ■  ■ ,  A„  are  independent 
Ber{p)  distributed  random  variables. 

14.6  a  P(A  <  25)  «  0.5,  P(A  <  26)  « 
0.6141. 

14.6  b  P(A  <  2)  «  0. 

14.9  a  5.71%. 

14.9  b  Yes! 


14.10  a  91. 

14.10  b  Use  that  (M„  —  c)/cr  has 
N(Q,  1)  distribution. 

15.3  a  _ 


Bin 

Height 

(0,250] 

0.00297 

(250,500] 

0.00067 

(500,750] 

0.00015 

(750,1000] 

0.00008 

(1000,1250] 

0.00002 

(1250,1500] 

0.00004 

(1500,1750] 

0.00004 

(1750,2000] 

0 

(2250,2500] 

0 

(2250,2500] 

0.00002 

15.3  b  Skewed. 


15.4  a 


Bin 

Height 

]0,500] 

0.0012741 

(500,1000] 

0.0003556 

(1000,1500] 

0.0001778 

(1500,2000] 

0.0000741 

(2000,2500] 

0.0000148 

(2500,3000] 

0.0000148 

(3000,3500] 

0.0000296 

(3500,4000] 

0 

(4000,4500] 

0.0000148 

(4500,5000] 

0 

(5000,5500] 

0.0000148 

(5500,6000] 

0.0000148 

(6000,6500] 

0.0000148 

an 
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15.4  b 


t 

Fr,{t) 

t 

0 

0 

3500 

0.9704 

500 

0.6370 

4000 

0.9704 

1000 

0.8148 

4500 

0.9778 

1500 

0.9037 

5000 

0.9778 

2000 

0.9407 

5500 

0.9852 

2500 

0.9481 

6000 

0.9926 

3000 

0.9556 

6500 

1 

15.4  c  Both  are  equal  to  0.0889. 

15.5 


Bin 

Height 

(0,1] 

0.2250 

(1,3] 

0.1100 

(3,5] 

0.0850 

(5,8] 

0.0400 

(8, 11] 

0.0230 

(11,14] 

0.0350 

(14, 18] 

0.0225 

15.6  i4(7)  =  0.9. 

15.11  Use  that  the  number  of  Xi  in 
(a,  b]  equals  the  number  ot  Xi  <  b  minus 
the  number  of  Xi  <  a. 

15.12  a  Bring  the  integral  into  the  sum, 
change  the  integration  variable  to  u  = 
{t  —  Xi)/h,  and  use  the  properties  of  ker¬ 
nel  functions. 

15.12  b  Similar  to  a. 

16.1a  Median:  290. 

16.1b  Lower  quartile:  81;  upper  quar- 
tile:  843;  IQR:  762. 

16.1c  144.6. 

16.3  a  Median:  70;  lower  quartile:  66.25; 
upper  quartile:  75. 


16.3  c  Note  the  position  of  31  in  the 
boxplot. 

16.4  a  Yes,  they  both  equal  7.056. 

16.4  b  Yes. 

16.4  c  Yes. 

16.6  a  Yes. 

16.6  b  In  general  this  will  not  be  true. 
16.6  c  Yes. 

16.8  MAD  is  3. 

16.10  a  The  sample  mean  goes  to  infin¬ 
ity,  whereas  the  sample  median  changes 
to  4.6. 

16.10  b  At  least  three  elements  need  to 
be  replaced. 

16.10  c  For  the  sample  mean  only  one; 
for  the  sample  median  at  least  [(n-|-l)/2j 
elements. 

16.12  xr^  =  {N  +  l)/2;  Med„  =  {N  + 
l)/2. 


Write  {x 

i-Xr,f  = 

xi  —  2XnXi 

A(3,l) 

A(0,1) 

A(0,1) 

A(3,l) 

Exp{l/2i) 

Exp{l) 

A(0,1) 

A(0,9) 

Exp{l) 

A(3,l) 

A(0,9) 

Exp  {1/3) 

A(0,9)  Exp(l/i)  Exp(l) 


16.3  b 
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17.2 


Exp  (1/3) 

X(0,9) 

Exp  {1/3) 

X(0,1) 

X(3,l) 

Exp(l) 

X(0,9) 

X(0,9) 

X(3.1) 

Exp  (1) 

X(3,l) 

Exp(l) 

X(0,1) 

X(0,1) 

Exp  (1/3) 

17.3a  5m(10,p). 

17.3  b  p  =  0.435. 

17.5  a  One  possibility  is  p  =  93/331;  an¬ 
other  is  p  =  29/93. 

17.5  b  p  =  474/1285  or  p  =  198/474. 

17.5  c  0.6281  or  0.6741  for  smokers  and 
0.7486  or  0.8026  for  nonsmokers. 

17.7  a  An  exponential  distribution. 

17.7  b  One  possibility  is  A  =  0.00469. 
17.9a  Recall  the  formula  for  the  vol¬ 
ume  of  a  cylinder  with  diameter  d  (at 
the  base)  and  height  h. 

17.9  b  Zn  =  0.3022;  y/x  =  0.3028;  least 
squares:  0.3035. 

18.1  5®  =  15625.  Not  equally  likely. 

18.3  a  0.0574. 

18.3  b  0.0547. 

18.3  c  0.000029. 

18.4  a  0.3487. 

18.4b  (1  -  1/n)". 

18.5  values  0,  ±1,  ±2,  and  ±3  with 
probabilities  7/27,  6/27,  3/27,  and  1/27. 

18.7  Determine  from  which  parametric 
distribution  you  generate  the  bootstrap 
datasets  and  what  the  bootstrapped  ver¬ 
sion  is  of  Xn  —  pL. 

18.8a  Determine  from  which  F  you 
generate  the  bootstrap  datasets  and 
what  the  bootstrapped  version  is  of  Xn  — 

fi. 

18.8b  Similar  to  a. 

18.8  c  Similar  to  a  and  b. 

18.9  Determine  which  normal  distribu¬ 
tion  corresponds  to  X/ ,  X| , . . . ,  X*  and 
use  this  to  compute  P(|X*  —  /r*|  >  l). 


19.1a  First  show  that  E  [Xf  ]  =  6^/3, 
and  use  linearity  of  expectations. 

19.1  b  y/r  has  negative  bias. 

19.3  a  =  1/n,  6  =  0. 

19.5  c  =  n. 

19.6  a  Use  linearity  of  expectations  and 
plug  in  the  expressions  for  E  [M„]  and 
E[X„]. 

19.6  b  {nMn  —  Xn) / {n  —  1) . 

19.6  c  Estimate  for  6:  2073.5. 

19.8  Check  that  E  [Yi]  =  Pxi  and  use 
linearity  of  expectations. 

20.2  a  We  prefer  T. 

20.2  b  If  a  <  6  we  prefer  T;  if  a  >  6  we 
prefer  S. 

20.3  Ti. 

20.4a  E[3L-l]  =  3E[X-bl-M]-l  =  X. 
20.4b  (X  +  l)(X-2)/2. 

20.4  c  4  times. 

20.7  Var(Ti)  =  (4  -  e'^)/n  and 
Var(r2)  =  0(4  —  6)/n.  We  prefer  T2. 

20.8  a  Use  linearity  of  expectations. 

20.8  b  Differentiate  with  respect  to  r. 
20.11  MSE(Ti)  =  <7V(Er=i4), 

MSE(r2)  =  (nVn^).Er.i(l/4), 

MSE(r3)  =  a2n/(EEi:rO^ 

21.1  D2. 

21.2  p  =  1/4. 

21.4  a  Use  that  Xi, . . .  ,  X„  are  indepen¬ 
dent  Pois{p)  distributed  random  vari¬ 
ables. 

21.4  b  e{p)  =  (EILi 
—  In  (xi!  •  X2!  •  •  •  x„\)  —  np,  [i  =  Xn- 

21.4c  e“^". 


21.5a  Xn  • 
21.5b 


2I.T  x/^ELr*?- 
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21.8  a  L{e)  =  ^  .  (2  +  •  032  . 

(1  -  61)1®!°;  ^  _  3339 in(4)  + 

1997 ln(2  +  61)  +  32  ln(6l)  +  1810  ln(l  -  61). 

21.8  b  0.0357. 

21.8c  (— 6  + V^)/(2n),  with  b  =  — ni  + 
n2  +  2n3  +  2714,  and  D  =  (ni  —n^  —  2713  — 

2114)^  +  871712. 

21.9  d  =  37(1)  and  j3  =  a;(„). 

21.11a  l/x„. 

21.11b  7/(„). 

22.1a  a  =  2.35,  /3  =  -0.25. 

22.1b  n  =  -0.1,  r2  =  0.2,  =  -0.1. 

22.1c  The  estimated  regression  line 
goes  through  (0,  2.35)  and  (3, 1.6). 

22.5  Minimize  “  Pxi)^ . 

22.6  2218.45. 

22.8  The  model  with  no  intercept. 

22.10  a  d  =  7/3,  /3  =  -1,  A(d,/3)  = 
4/3. 

22.10  b  17/9  <  a  <  7/3,  a  =  2. 

22.10c  a  =  2,  /3  =  -1. 

22.12  a  Use  that  the  denominator  of  /3 
and  that  J/)  Xi  are  nnmbers,  not  random 
variables. 

22.12  b  Use  that  E  [Yi]  =  a  +  jSxi. 

22.12  c  Simplify  the  expression  in  b. 
22.12  d  Combine  a  and  c. 

23.1  (740.55,745.45). 

23.2  (3.486,3.594). 

23.5  a  (0.050,1.590). 

23.5  b  See  Section  23.3. 

23.5  c  (0.045,  1.600). 

23.6  a  Rewrite  the  probability  in  terms 
of  Ln  and  Un- 

23.6  b  (3/„  +  7,377„  +  7). 

23.6c  Ln  =  1  —  Un  and  Un  =  1  —  Ln- 
The  confidence  interval:  (—4,3). 

23.6  d  (0,  25)  is  a  conservative  95%  con¬ 
fidence  interval  for  9. 

23.7  (e“®,e-=)  =  (0.050,0.135). 
23.11a  Yes. 


23.11b  Not  necessarily. 

23.11c  Not  necessarily. 

24.1  (0.620,0.769). 

24.4  a  609. 

24.4  b  No. 

24.6a  (1.68,00). 

24.6b  [0,2.80). 

24.8a  (0.449,0.812). 

24.8b  (0.481,1]. 

24.9  a  See  Section  8.4. 

24.9  b  Cl  =  0.779,  Cn  =  0.996. 

24.9c  (3.013,3.851). 

24.9d  (m/(l  -  a/2)i/",7n/(a/2)i/"). 

25.2  Hi:  fi>  1472. 

25.4  a  The  difference  or  the  ratio  of  the 
average  numbers  of  cycles  for  the  two 
groups. 

25.4  b  The  difference  or  the  ratio  of 
the  maximum  likelihood  estimators  pi 
and  P2  ■ 

25.4c  Hi  :  Pi  <  p2. 

25.5  a  Relevant  values  of  Ti  are  in  [0,  5]; 
those  close  to  0,  or  close  to  5,  are  in  favor 
of  Hi. 

25.5  b  Relevant  values  of  T2  are  in  [0,  5]; 
only  those  close  to  0  are  in  favor  oi  Hi. 

25.6  a  The  p- value  is  0.23.  Do  not  reject. 

25.6  b  The  p-value  is  0.77.  Do  not  re¬ 
ject. 

25.6  c  The  p- value  is  0.968.  Do  not  re¬ 
ject. 

25.6  d  The  p-value  is  0.019.  Reject. 

25.6  e  The  p-value  is  0.99.  Do  not  reject. 

25.6  f  The  p-value  is  smaller  than  0.019. 
Reject. 

25.6  g  The  p-value  is  smaller  than 
0.200.  We  cannot  say  anything  about  re¬ 
jection  of  Hq. 

25.10  a  Hi:  p>  23.75. 

25.10  b  The  p-value  is  0.0344. 

25.11  0.0456. 


26.3a  0.1. 

26.3  b  0.72. 

26.5  a  The  p-value  is  0.1050.  Do  not  re¬ 
ject  Ho\  this  agrees  with  Exercise  24.8  b. 

26.5  b  7t'  =  {16, 17,  ...,23}. 

26.5  c  0.0466. 

26.5  d  0.6950. 

26.6  a  Right  critical  value. 

26.6  b  Right  critical  value  c  =  1535.1; 
critical  region  [1536,  oo). 

26.8  a  For  T  we  find  K  =  (0,  C(]  and  for 
T'  we  find  K'  =  [cu,  1). 

26.8  b  For  T  we  find  K  =  (0,  ci]  U 
[cu,  oo)  and  for  T'  we  find  K'  —  (0,  c[]  U 

[4,1). 

26.9a  For  T  we  find  K  =  [c„,oo)  and 
for  T'  we  find  K'  =  [c[,  0)  U  (0,  c'J^. 

26.9b  For  T  we  find  K  =  [cu,oo)  and 
for  T'  we  find  K'  =  (0,  . 

27.2  a  Ho  :  fJ,  =  2550  and  :  /r  / 
2550. 

27.2  b  t  =  1.2096.  Do  not  reject  Ho- 
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27.5  a  Ho  :  ^  0-  Hi  :  >  0;  t  =  0.70. 

27.5  b  p-value:  0.2420.  Do  not  reject 
Ho- 

27.7  a  Ho  :  /3  =  0  and  Hi  :  <  0; 

tb  =  —20.06.  Reject  Ho- 

27.7  b  Same  testing  problem; 
tb  =  —11.03.  Reject  Ho- 

28.1a  Ho  ■  fJ-i  =  fJ-2  and  T/i  :  pi  /  p2; 
tp  =  —2.130.  Reject  Ho- 

28.1b  Ho  ■■  /J-i  —  p2  and  i7i  :  pi  /  p2; 
td  =  —2.130.  Reject  Ho- 
28.1  c  Reject  Ho-  The  salaries  differ  sig¬ 
nificantly. 

28.3a  tp  =  2.492.  Reject  Ho- 
28.3  b  Reject  Ho- 
28.3c  td  =  2.463.  Reject  Ho- 
28.3d  Reject  Ho- 

28.5  a  Determine  E  \_aS\  -I-  bSy] ,  using 
that  Sx  and  Sy  are  both  unbiased  for  cr^. 

28.5  b  Determine  E  \aSx  +  (1  ~  fl)>S'v] , 
using  that  Sx  and  Sy  are  independent, 
and  minimize  over  a. 
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2.8  From  the  rule  for  the  probability  of  a  union  we  obtain  P(-Di  U  D2)  <  P(Oi)  + 
P(f?2)  =  2  •  10“®.  Since  Di  n  D2  is  contained  in  both  Di  and  D2,  we  obtain 
P(-Di  n  D2)  <  min{P(_Di) ,  P(D2)}  =  10“®.  Equality  may  hold  in  both  cases:  for 
the  union,  take  Di  and  D2  disjoint,  for  the  intersection,  take  Di  and  D2  equal  to 
each  other. 

2.12  a  This  is  the  same  situation  as  with  the  three  envelopes  on  the  doormat,  but 
now  with  ten  possibilities.  Hence  an  outcome  has  probability  1/10!  to  occur. 

2.12  b  For  the  five  envelopes  labeled  1,2,3,  4,  5  there  are  5!  possible  orders,  and 
for  each  of  these  there  are  5!  possible  orders  for  the  envelopes  labeled  6,  7,  8,  9, 10. 
Hence  in  total  there  are  5!  •  5!  outcomes. 

2.12  c  There  are  32-5I-5!  outcomes  in  the  event  “dream  draw.”  Hence  the  probability 
is  32  •  5151/10!  =  32  •  1  ■  2  •  3  •  4  •  5/(6  •  7  •  8  •  9  •  10)  =  8/63  =12.7  percent. 

2.14  a  Since  door  a  is  never  opened,  P((a,  a))  —  P{{b,  a))  —  P((c,  a))  —  0.  If  the  can¬ 
didate  chooses  a  (which  happens  with  probability  1/3),  then  the  quizmaster  chooses 
without  preference  from  doors  b  and  c.  This  yields  that  P((a,b))  —  P((a,c))  =  1/6. 
If  the  candidate  chooses  b  (which  happens  with  probability  1/3),  then  the  quizmas¬ 
ter  can  only  open  door  c.  Hence  P((6,  c))  =  1/3.  Similarly,  P((c,  6))  =  1/3.  Clearly, 
P((&,6))  =  P((c,c))  =  0. 

2.14  b  If  the  candidate  chooses  a  then  she  or  he  wins;  hence  the  corresponding 
event  is  {(a,  a),  (a,  b),  {a,  c)},  and  its  probability  is  1/3. 

2.14  c  To  end  with  a  the  candidate  should  have  chosen  b  or  c.  So  the  event  is 
{{b,  c),  (c,  b)}  and  P({(b,  c),  (c,  b)})  =  2/3. 

2.16  Since  E  n  E  n  G  =  0,  the  three  sets  E  D  F,  F  D  G,  and  E  n  G  are  disjoint. 
Since  each  has  probability  1/3,  they  have  probability  1  together.  From  these  two 
facts  one  deduces  P(E)  =  P(E  n  E)  +  P(E  n  G)  =  2/3  (make  a  diagram  or  use  that 
E  =  En(EnE)uEn(EnG)uEn(En  G)). 

3.1  Define  the  following  events:  B  is  the  event  “point  B  is  reached  on  the  second 
step,”  G  is  the  event  “the  path  to  G  is  chosen  on  the  first  step,”  and  similarly  we 
define  D  and  E.  Note  that  the  events  G,  D,  and  E  are  mutually  exclusive  and  that 
one  of  them  must  occur.  Furthermore,  that  we  can  only  reach  B  by  first  going  to  G 
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or  D.  For  the  computation  we  use  the  law  of  total  probability,  by  conditioning  on 
the  result  of  the  first  step: 


P(B)  =  P{B  0(7)+  P{B  nD)+  P{B  n  E) 

=  P{B  I  (7)  P((7)  +  P{B  I  D)  P{D)  +  P{B  \  E)  P{E) 

_11  11 
^3'3'^4'3'^3'^^36' 

3.2  a  Event  A  has  three  outcomes,  event  B  has  11  outcomes,  and  A  C]  B  — 
{(1,  3),  (3, 1)}.  Hence  we  find  P(-B)  =  11/36  and  P(H  n  B)  =  2/36  so  that 


P{A\B) 


P{AnB) 

P{B) 


2/36 

11/36 


IT' 


3.2  b  Because  P(A)  =  3/36  =  1/12  and  this  is  not  equal  to  2/11  =  P(7l  |  B)  the 
events  A  and  B  are  dependent. 

3.3  a  There  are  13  spades  in  the  deck  and  each  has  probability  1/52  of  being  chosen, 
hence  P(5'i)  =  13/52  =  1/4.  Given  that  the  first  card  is  a  spade  there  are  13  —  1  =  12 
spades  left  in  the  deck  with  52  —  1  =  51  remaining  cards,  so  P(S2  |  5i)  =  12/51.  If 
the  first  card  is  not  a  spade  there  are  13  spades  left  in  the  deck  of  51,  so  P(5'2  |  Bf)  = 
13/51. 

3.3  b  We  use  the  law  of  total  probability  (based  on  =  5i  U  Sf): 


p(5'2)  =  p(5'2  n  Si)  +  P(S2  n  s/)  =  p(S2  |  Si)  p(Si)  +  p(S2  |  s/)  p(s/) 

_  12  1  ,  13  3  _  12  +  39  _  1 
^  W  '  4  ^  '  4  ^  51  ■  4  ^  4' 

3.7  a  The  best  approach  to  a  problem  like  this  one  is  to  write  out  the  conditional 
probability  and  then  see  if  we  can  somehow  combine  this  with  P(A)  =  1/3  to 
solve  the  puzzle.  Note  that  P(i3  n  H^)  =  P(i3  |  A"^)  P(H‘^)  and  that  P(H  U  B)  = 
P(H)  +P(BnH").  So 


P(AuS).|  +  +(l-l) 


1 

2' 


3.7  b  From  the  conditional  probability  we  find  P(H‘=  n  B=)  =  P(H'=  |  B=)  P(B=)  = 
i(l  — P(B)).  Recalling  DeMorgan’s  law  we  know  P(7l°  n  B"^)  =  P((HuB)“)  = 
1  — P(H  U  B)  =  1/3.  Combined  this  yields  an  equation  for  P(B):  |  (1  —  P(B))  =  1/3 
from  which  we  find  P(B)  =  1/3. 

3.8a  This  asks  for  P(IT).  We  use  the  law  of  total  probability,  decomposing  = 
F  U  B7  Note  that  P{W  \  F)  =  0.99. 


p{w)  =  P{w  n  B)  +  P{w  n  B")  =  P{w  \  B)  p(b)  +  p{w  \  b")  p(b") 
=  0.99  •  0.1  +  0.02  •  0.9  =  0.099  +  0.018  =  0.117. 


3.8  b  We  need  to  determine  P(B  |  W),  and  this  can  be  done  using  Bayes’  rule.  Some 
of  the  necessary  computations  have  already  been  done  in  a,  we  can  copy  P{W  n  B) 
and  P(VF)  and  get: 


p(Bn  w) 
P(W) 


0.099 

0.117 


0.846. 


P(B|  W) 
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4.1a  In  two  independent  throws  of  a  die  there  are  36  possible  outcomes,  each 
occurring  with  probability  1/36.  Since  there  are  25  ways  to  have  no  6’s,  10  ways  to 
have  one  6,  and  one  way  to  have  two  6’s,  we  hnd  that  pz(0)  =  25/36,  pz(l)  =  10/36, 
and  pz(2)  =  1/36.  So  the  probability  mass  function  pz  of  Z  is  given  by  the  following 
table: 

z  0  12 

25  12  J_ 

3g  3g  3g 

The  distribution  function  Fz  is  given  by 


Fz{a) 


'0  foro<0 

II  for  0  <  a  <  1 

i  +  i  =  i  for  1  <  a  <  2 

—  +  —  +  —  —  1  fora>2 
v  36  ^  36  ^  36  ^  ^  iUl  u  ^  z. 


Z  is  the  sum  of  two  independent  Ber(l/Q)  distributed  random  variables,  so  Z  has 
a  Bin  (2, 1/6)  distribution. 

4.1b  If  we  denote  the  outcome  of  the  two  throws  by  (i,j),  where  i  is  the  out¬ 
come  of  the  first  throw  and  j  the  outcome  of  the  second,  then  {M  =  2,  Z  =  0}  = 
{  (2,  1),  (1,  2),  (2,  2)  },  {S  =  5,  Z  =  1}  =  0,  {S  =  8,  Z  =  1}  =  {  (6,  2),  (2,  6)  }.  Fur¬ 
thermore,  P(M  =  2,  Z  =  0)  =  3/36,  P(S  =  5,  Z  =  1)  =0,  and  P(S'  =  8,  Z  =  1)  = 
2/36. 

4.1  c  The  events  are  dependent,  because,  e.g.,  P(M  =  2,  Z  =  0)  =  ^  differs  from 
P(M  =  2).P(Z  =  0)  =  ^-^. 

4.10  a  Each  Ri  has  a  Bernoulli  distribution,  because  it  can  only  attain  the  values  0 
and  1.  The  parameter  is  p  =  P(Ri  —  1).  It  is  not  easy  to  determine  P{Ri  =  1),  bnt 
it  is  fairly  easy  to  determine  P{Ri  =  0).  The  event  {Ri  —  0}  occurs  when  none  of 
the  m  people  has  chosen  the  ith  floor.  Since  they  make  their  choices  independently 
of  each  other,  and  each  floor  is  selected  by  each  of  these  m  people  with  probability 
1/21,  it  follows  that 


Now  use  that  p  =  P{Ri  =  1)  =  1  —  P{Ri  —  0)  to  find  the  desired  answer. 

4.10  b  If  {i?i  =  0}, . . . ,  {i?2o  =  0},  we  must  have  that  {i?2i  =  1},  so  we  cannot 
conclude  that  the  events  {i?i  =  ai}, . . . ,  {R21  =  n2i},  where  ai  is  0  or  1,  are  indepen¬ 
dent.  Consequently,  we  cannot  use  the  argument  from  Section  4.3  to  conclude  that 
Sm  is  Bm(21,p).  In  fact,  Sm  is  not  Bin{21,p)  distributed,  as  the  following  shows. 
The  elevator  will  stop  at  least  once,  so  P(S'r,i  =  0)  =  0.  However,  if  Sm.  would  have 
a  Bin{21,p)  distribution,  then  P(S'm  =  0)  =  (1  —p)^^  >  0,  which  is  a  contradiction. 

4.10  c  This  exercise  is  a  variation  on  Ending  the  probability  of  no  coincident  birth¬ 
days  from  Section  3.2.  For  m  —  2,  S2  ~  1  occurs  precisely  if  the  two  persons  entering 
the  elevator  select  the  same  floor.  The  first  person  selects  any  of  the  21  floors,  the 
second  selects  the  same  floor  with  probability  1/21,  so  P(S2  =  1)  =  1/21.  For  m  =  3, 
S3  =  1  occurs  if  the  second  and  third  persons  entering  the  elevator  both  select  the 
same  floor  as  was  selected  by  the  first  person,  so  P(S'3  =  1)  =  (1/21)^  =  1/441. 
Furthermore,  S3  =  3  occurs  precisely  when  all  three  persons  choose  a  different  floor. 
Since  there  are  21  •  20  ■  19  ways  to  do  this  out  of  a  total  of  21^  possible  ways,  we 
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find  that  P(S3  =  3)  =  380/441.  Since  S3  can  only  attain  the  values  1,  2,  3,  it  follows 
that  P(S3  =  2)  =  1  -  P(5'3  =  1)  -  P(53  =  3)  =  60/441. 

4.13  a  Since  we  wait  for  the  first  time  we  draw  the  marked  bolt  in  independent 
draws,  each  with  a  Berijp)  distribution,  where  p  is  the  probability  to  draw  the  bolt 
(so  p  =  l/N),  we  find,  using  a  reasoning  as  in  Section  4.4,  that  X  has  a  Geo{l/N) 
distribution. 

4.13  b  Clearly,  P(y  =  1)  =  1/A^.  Let  Di  be  the  event  that  the  marked  bolt  was 
drawn  (for  the  first  time)  in  the  ith  draw.  For  k  —  2, . . . ,  N  we  have  that 

p(y  =  fc)  =  P{Dt  n  ■  •  ■  n  n 

=  P(Dfe  I n  ■  •  ■  n  Di_,)  ■  P{D’i  n  •  •  •  n  Dl_,) . 

Now  P{D^  \D‘ln---n  Di_,)  = 

P{D^i  n  •  ■  •  n  Dl_i)  =  PiDl_3 1 n  •  ■  •  n  DU2)  ■  PiD^i  n  •  •  ■  n  DU^) , 

and 

P{DUi  \Dtn---n  Dl_,)  =  1  -  P(iifc_i  I n  ■  •  ■  n  D^i)  =  1  - 

Continuing  in  this  way,  we  find  after  k  steps  that 

Ti(Y  -lA-  ^  N  -k  +  1  N  -k  +  2  N  -2  N  -1  _  1 
^  ~  N -k+l'  N -k  +  2'  N -k  +  2,"  '  N -l'  N  ^  N' 

See  also  Section  9.3,  where  the  distribution  of  Y  is  derived  in  a  different  way. 

4.13  c  For  fc  =  0, 1, . . . ,  r,  the  probability  P{Z  =  k)  is  equal  to  the  number  of  ways 
the  event  {Z  =  k}  can  occur,  divided  by  the  number  of  ways  (^)  we  can  select  r 
objects  from  N  objects,  see  also  Section  4.3.  Since  one  can  select  k  marked  bolts 
from  m  marked  ones  in  C^)  ways,  and  r  —  k  nonmarked  bolts  from  N  —  m  nonmarked 
ones  in  (^I^)  ways,  it  follows  that 

/m\  /iV  — m\ 

P{Z  =  k)^  ^  for  fc  =  0,  1,2,  ...,r. 

\  r  J 

5.4  a  Let  T  be  the  time  until  the  next  arrival  of  a  bus.  Then  T  has  {7(4,6)  distri¬ 
bution.  Hence  P(X  <  4.5)  =  P(T  <  4.5)  =  ®  1/2  da;  =  1/4. 

5.4  b  Since  Jensen  leaves  when  the  next  bus  arrives  after  more  than  5  minutes, 
P(X  =  5)  =  P(r  >  5)  =  /g®  i  da:  =  1/2. 

5.4  c  Since  P(X  =  5)  =  0.5  >  0,  X  cannot  be  continuous.  Since  X  can  take  any  of 
the  uncountable  values  in  [4,  5],  it  can  also  not  be  discrete. 

5.8  a  The  probability  density  g{y)  =  l/(2yTy)  has  an  asymptote  in  0  and  decreases 
to  l/2r  in  the  point  r.  Outside  [0,  r]  the  function  is  0. 

5.8  b  The  second  darter  is  better:  for  each  0  <  b  <  r  one  has  (b/r)^  <  y^bjr  so  the 
second  darter  always  has  a  larger  probability  to  get  closer  to  the  center. 

5.8  c  Any  function  F  that  is  0  left  from  0,  increasing  on  [0,r],  takes  the  value  0.9 
in  r/10,  and  takes  the  value  1  in  r  and  to  the  right  of  r  is  a  correct  answer  to  this 
question. 
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5.13  a  This  follows  with  a  change  of  variable  transformation  x  i— >  —x  in  the  integral; 

^(-a)  =  Hx)  da;  =  XT  =  ir  '^(®)  d®  =  1  -  ^(a)- 

5.13  b  This  is  straightforward:  P(Z  <  —2)  =  'J‘(— 2)  =  1  —  4‘(2)  =  0.0228. 

6.5  We  see  that 

X  <  a  —  ln?7<a  ln[/>  —a  U  >  e~“, 

and  so  P(X  <  a)  =  P(C/  >  e““)  =  1  —  P(U  <  e““)  =  1  —  e““,  where  we  use 
P(f/  <  p)  =  p  for  0  <  p  <  1  applied  to  p  =  e““  (remember  that  a  >  0). 

6.7  We  need  to  obtain  T'""',  and  do  this  by  solving  F{x)  =  u,  for  0  <  m  <  1: 

1  —  =  u  e~®“'  =  1  —  u  — 5a;^  =  ln(l  —  u) 

a;^  =  — 0.21n(l  —  u)  x  =  \/— 0.2  ln(l  —  u). 

The  solution  is  Z  =  \/— 0.2  In  U  (replacing  1  —  U  hy  U,  see  Exercise  6.3).  Note  that 
has  an  Exp  (5)  distribution. 

6.10  a  Define  random  variables  Bi  =  1  if  Ui  <  p  and  Bi  =  0  if  Ui  >  p.  Then 
V{Bi  =  1)  =  p  and  P{Bi  =  0)  —  1  —  p:  each  Bi  has  a  Ber{p)  distribution.  If  Bi  = 
B2  =  •  •  •  =  Bk-i  ~  0  and  Bk  =  1,  then  N  —  k,  i.e.,  N  is  the  position  in  the 
sequence  of  Bernoulli  random  variables,  where  the  first  1  occurs.  This  is  a  Geo(p) 
distribution.  This  can  be  verified  by  computing  the  probability  mass  function:  for 

k>l, 


P(N  =  k)  =  P(Bi  =  B2  =  . . .  =  Bk-i  =  0,  Bfc  =  1) 

=  P(Bi  =  0)  P(B2  =  0)  ■  •  ■  P(Bfe-i  =  0)  P{Bk  =  1) 

=  (1 

6.10  b  If  Y  is  (a  real  number!)  greater  than  n,  then  rounding  upwards  means  we 
obtain  n  +  1  or  higher,  so  {Y  >  n}  =  {Z  >  n  +  1}  =  {Z  >  n}.  Therefore, 
P{Z  >  n)  =  P(y  >  n)  =  e“^”  =  (e”"^)".  From  A  =  —  ln(l  —  p)  we  see:  e“^  =  1  —  p, 
so  the  last  probability  is  (1  —  p)".  From  P{Z  >  n  —  1)  =  P{Z  =  n)  +  P(Z  >  n)  we 
find:  P(Z  =  n)  =  P(Z  >  n  -  1)-P(Z  >  n)  =  (l-p)"-i  -  (1  -p)"  =  (l-p)"-ip. 
Z  has  a  Geoijp)  distribution. 

6.12  We  need  to  generate  stock  prices  for  the  next  five  years,  or  60  months.  So  we 
need  sixty  1/(0, 1)  random  variables  U\,  . . .,  U&o.  Let  St  denote  the  stock  price  in 
month  i,  and  set  5o  =  100,  the  initial  stock  price.  From  the  Ui  we  obtain  the  stock 
movement,  as  follows,  for  i  =  1,  2, . . .: 

fO.gSSi-i  if  [/i<  0.25, 

Si  =  I  Si-1  if  0.25  <Ui<  0.75, 

[l.05S'i-i  if  [/i>  0.75. 

We  have  carried  this  out,  using  the  realizations  below: 


1- 

-10 

0.72 

0.03 

0.01 

0.81 

0.97 

0.31 

0.76 

0.70 

0.71 

0.25 

11- 

-20 

0.88 

0.25 

0.89 

0.95 

0.82 

0.52 

0.37 

0.40 

0.82 

0.04 

21- 

CO 

0 

0.38 

0.88 

0.81 

0.09 

0.36 

0.93 

0.00 

0.14 

0.74 

0.48 

31- 

-40 

0.34 

0.34 

0.37 

0.30 

0.74 

0.03 

0.16 

0.92 

0.25 

0.20 

41- 

-50 

0.37 

0.24 

0.09 

0.69 

0.91 

0.04 

0.81 

0.95 

0.29 

0.47 

51- 

-60 

0.19 

0.76 

0.98 

0.31 

0.70 

0.36 

0.56 

0.22 

0.78 

0.41 
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We  do  not  list  all  the  stock  prices,  just  the  ones  that  matter  for  our  investment 
strategy  (you  can  verify  this).  We  first  wait  until  the  price  drops  below  €  95,  which 
happens  at  Si  =  94.76.  Our  money  has  been  in  the  bank  for  four  months,  so  we  own 
€1000  •  1.005“^  =  €1020.15,  for  which  we  can  buy  1020.15/94.76  =  10.77  shares. 
Next  we  wait  until  the  price  hits  €  110,  this  happens  at  5i6  =  114.61.  We  sell  the 
our  shares  for  €  10.77  •  114.61  =  €  1233.85,  and  put  the  money  in  the  bank.  At 
Si2  =  92.19  we  buy  stock  again,  for  the  €  1233.85  •  1.005^^  =  €  1411.71  that  has 
accrued  in  the  bank.  We  can  buy  15.31  shares.  For  the  rest  of  the  five  year  period 
nothing  happens,  the  final  price  is  Seo  =  100.63,  which  puts  the  value  of  our  portfolio 
at  €  1540.65. 

For  a  real  simulation  the  above  should  be  repeated,  say,  one  thousand  times.  The 
one  thousand  net  results  then  give  us  an  impression  of  the  probability  distribution 
that  corresponds  to  this  model  and  strategy. 

7.6  Since  /  is  increasing  on  the  interval  [2,3]  we  know  from  the  interpretation  of 
expectation  as  center  of  gravity  that  the  expectation  should  lie  closer  to  3  than  to  2. 
The  computation:  E[Z]  =  dz  =  —  2||. 

7.15  a  We  use  the  change-of-units  rule  for  the  expectation  twice: 

Var(rX)  =  E  [{rX  -  E  [rXf)]  =  E  [{rAT  -  rE  [X])^] 

=  E[r^(X-E[X])^]  =r^E[{X -E[X]f]  =  r^Vai(X) . 

7.15  b  Now  we  use  the  change-of-units  rule  for  the  expectation  once: 

Var(X  -h  s)  =  E  [((X  -h  s)  -  E  [X  -h  s])^] 

=  E[((X  +  s)-E[X]-hs)"]  =E[(X-E[X])"]  =  Var(X) . 

7.15  c  With  first  b,  and  then  a:  Var(rX  -I-  s)  =  Var(rX)  =  r^Var(X) . 

7.17  a  Since  Oi  >  0  and  pi  >  0  it  must  follow  that  aipi  -I-  •  •  •  -I-  UrPr  >  0.  So 
0  =  E  [U]  =  aipi  -I-  •  •  •  -I-  arPr  >  0.  As  we  may  assume  that  all  pi  >  0,  it  follows  that 

cti  =  £12  =  •  •  •  =  fflr  =  0. 

7.17  b  Let  m  =  E  [V]  =  pi5i  +  -  •  ■+Prbr-  Then  the  random  variable  U  =  (X— E  [X])^ 
takes  the  values  ai  =  (fei  —  m)^, . . . ,  Ur  =  {br  —  m)^.  Since  E  [U]  =  Var(17)  =  0,  part 
a  tells  us  that  0  =  ai  =  (6i  —  m)^,...,0  =  Ur  =  {br  —  m)^ .  But  this  is  only 
possible  if  bi  =  m, ...  ,br  =  m.  Since  m  =  E[F],  this  is  the  same  as  saying  that 
P(l/  =  E[l/])  =  1. 

8.2  a  First  we  determine  the  possible  values  that  Y  can  take.  Here  these  are  —1,0, 
and  1.  Then  we  investigate  which  a:- values  lead  to  these  t/- values  and  sum  the  prob¬ 
abilities  of  the  i-values  to  obtain  the  probability  of  the  y-value.  For  instance, 

P(y  =  0)  =  P(X  =  2)  +  P(X  =  4)  +  P(X  =  6)  =  i  i  i  =  i. 

Similarly,  we  obtain  for  the  two  other  values 

P(y  =  -1)  =  P(X  =  3)  =  i,  P(y  =  1)  =  P(X  =  1)  +  P(X  =  5)  =  i. 

D  O 

8.2  b  The  values  taken  by  Z  are  —1,  0,  and  1.  Furthermore 

=  0)  =  P(X  =  1)  +  P(X  =  3)  +  P(X  =  5)  =  i  i  i  =  i, 
and  similarly  P(Z  =  —1)  =  1/3  and  P(Z  =  1)  =  1/6. 


D  Full  solutions  to  selected  exercises 


451 


8.2  c  Since  for  any  a  one  has  sin^(Q)  +  cos^(a)  =  1,  W  can  only  take  the  value  1, 
so  P(W  =  1)  =  1. 

8.10  Because  of  symmetry:  P(X  >  3)  =  0.500.  Furthermore:  =  4,  so  a  =  2. 

Then  Z  =  [X  —  3)/2  is  an  N{0, 1)  distributed  random  variable,  so  that  P(X  <  1)  = 
P((X  -  3)/2)  <  (1  -  3)/2  =  P(Z  <  -1)  =  P{Z  >  1)  =  0.1587. 

8.11  Since  —g  is  a  convex  function,  Jensen’s  inequality  yields  that  — g(E[X])  < 
E[— p(X)].  Since  E[— p(X)]  =  — E[(;(X)],  the  inequality  follows  by  multiplying  both 
sides  by  —1. 

8.12  a  The  possible  values  Y  can  take  are  y/O  =  0,  y/l  =  1,  \/T00  =  10,  and 
x/TOOOO  =  100.  Hence  the  probability  mass  function  is  given  by 


0  1  10  100 


P(y  =  y) 

8.12  b  Compute  the  second  derivative:  -§^\/x  =  <  0.  Hence  g{x)  =  —yfx 

is  a  convex  function.  Jensen’s  inequality  yields  that  ^E  [X]  >  E  |^V^j  ■ 

8.12  c  We  obtain  y/Y[X]  =  ^(0  +  1  +  100  +  10 000)/4  =  50.25,  but 


E 


[Vx]  =  E  [y]  =  (0  +  1  +  10  +  100) /4  =  27.75. 


8.19a  This  happens  for  all  ip  in  the  interval  [tt/J, 7r/2],  which  corresponds  to  the 
upper  right  quarter  of  the  circle. 

8.19b  Since  {Z  <t}  =  {X  <  arctan(t)},  we  obtain 

Fz{t)  =  P{Z  <  t)  =  P(X  <  arctan(t))  =  i  +  i  arctan(t). 

2  TT 

8.19  c  Differentiating  Fz  we  obtain  that  the  probability  density  function  of  Z  is 

1 


hU)  -  =  ± 


1  +  i  arctan(«)  )  = 

2  TT 


7r(l  +  z^) 


for  —  oo  <  z  <  oo. 


9.2  a  From  P(X  =  1,  y  =  1)  =  1/2,  P(X  =  1)  =  2/3,  and  the  fact  that  P(X  =  1)  = 
p(x  =  i,y  =  1)  +  p(x  =  i,y  =  -1),  it  follows  that  P(x  =  i,y  =  -1)  =  i/e. 
Since  P(y  =  1)  =  1/2  and  P(X  =  1,  y  =  1)  =  1/2,  we  must  have:  P(X  =  0,  y  =  1) 
and  P(Jf  =  2,  y  =  1)  are  both  zero.  From  this  and  the  fact  that  P(Jf  =  0)  =  1/6  = 
P(X  =  2)  one  finds  that  P(X  =  0,  y  =  -1)  =  1/6  =  P(X  =  2,  y  =  -1). 

9.2  b  Since,  e.g.,  P(X  =  2,  y  =  1)  =  0  is  different  from  P(X  =  2)  P(y  =  1)  =  i  ■  i, 
one  finds  that  X  and  Y  are  dependent. 

9.8  a  Since  X  can  attain  the  values  0  and  1  and  Y  the  values  0  and  2,  Z  can  attain 
the  values  0,  1,  2,  and  3  with  probabilities:  P{Z  =  0)  =  P(X  =  0,y  =  0)  =  1/4, 
P(^  =  l)  =  P(X  =  l,y  =  0)  =  1/4,  P(Z  =  2)  =  P(X  =  0,y  =  2)  =  1/4,  and 
pIz  =  3)  =  P(X  =  1,  y  =  2)  =  1/4. 

9.8  b  Since  X  =  Z  —  Y ,  X  can  attain  the  values  —2,  —1,  0,  1,  2,  and  3  with 
probabilities 
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=  -2)  =  P(^Z  =  0,y  =  2)  =  1/8, 

P(^X  =  -1)  =  P(^Z  =  l,y  =  2)  =  1/8, 
p(^x  =  0)  =  p{z  =  o,y  =  0)  +  p{z  =  2,y  =  2)  =  1/4, 

p(^x  =  1)  =  p[z  =  i,y  =  0)  +p(^z  =  3,y  =  2)  =  1/4, 

p(^x  =  2)  =  p(^z  =  2,  y  =  0)  =  1/8, 

p(^x  =  3)  =  p(^z  =  3,  y  =  0)  =  1/8. 

We  have  the  following  table: 

z  -2-10123 

p^{z)  1/8  1/8  1/4  1/4  1/8  1/8 

9.9a  One  has  that  Fx{x)  =  liniy^oo  ^’(a:, y)-  So  for  a;  <  0:  Fx{x)  —  0,  and  for 

a;  >  0:  Fx{x)  =  F{x,  oo)  =  1  —  .  Similarly,  Fyiy)  ~  0  for  y  <  0,  and  for  y  >  0: 

FY{y)  =  F{cx,y)  =  l-e-y. 

9.9b  For  a;  >  0  and  y  >  0;  f{x,y)  =  -g^F{x,y)  =  = 

2g-(2x+y)^ 


9.9  c  There  are  two  ways  to  determine  fx(x)'. 


/OO  noo 

f{x,  y)  dy  =  /  e'^^^'^+^'dy  =  2e"^^ 

-  OO  Jo 


for  a:  >  0 


and 

fx{x)  =  -^Fxix)  =  for  a;  >  0. 

da: 

Using  either  way  one  finds  that  /v(y)  =  e““  for  y  >  0. 

9.9 d  Since  F(x,y)  =  Fx(x)FY(y)  for  all  x,y,  we  hnd  that  X  and  Y  are  indepen¬ 
dent. 

9.11  To  determine  P(X  <  y)  we  must  integrate  f{x,  y)  over  the  region  G  of  points 
(x,  y)  in  for  which  x  is  smaller  than  y: 


p(x  <  y)  = 


IL 


{(x,y)GM?-,x<y} 


f{x,y}dx  dy 


I"  (^1"  f{x,y)dx'^dy  =  1^"  ^xy{l  +  y)dx'jdy 
yO  +  y)(^J  xdx'^dy=^J  y^{l  +  y)dy 


12 

y 


50' 


Here  we  used  that  f{x,  y)  —  0  for  (a;,  y)  outside  the  unit  square. 

9.15  a  Setting  □(a, &)  as  the  set  of  points  {x,y),  for  which  x  <  a  and  y  <  5,  we 
have  that 

rr/..  area  (An  □(«,&)) 

r  (a,  0)  — 

area  or  A 

•  If  a  <  0  or  if  ti  <  0  (or  both),  then  area  (A  n  □(«,  h))  —  0,  so  F{a,  b)  =  0, 
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•  If  (a,  b)  £  A,  then  area  (A  n  0(0,  b))  =  a{b  —  ^a),  so  F{a,  b)  =  a{2b  —  a), 

•  If  0  <  6  <  1,  and  a  >  b,  then  area  (A  n  0(0,  b))  —  ^b^,  so  F{a,  b)  =  b^, 

•  If  0  <  a  <  1,  and  6  >  1,  then  area  (A  n  □(«,  b))  =  a  —  so  F{a,  b)  —  2a  —  a^, 

•  If  both  a  >  1  and  6  >  1,  then  area  (A  n  □(«,  fe))  =  |,  so  F{a,  b)  —  1. 

9.15  b  Since  f{x,y)  =  -g^F{x,y),  we  find  for  {x,y)  €  A  that  f{x,y)  =  2.  Fur¬ 
thermore,  f{x,  y)  —  0  for  {x,  y)  outside  the  triangle  A. 

9.15  c  For  X  between  0  and  1, 

/OO 

f{x,y)  dy  = 


/ 

X 


2dy  =  2{l-x). 


For  y  between  0  and  1, 

/OO  ry 

f{x,y)dy=  2dx  =  2y. 

-OO  Jo 

10.6a  When  c  =  0,  the  joint  distribution  becomes 


b 

-1 

a 

0 

1 

P(E  =  b) 

-1 

2/45 

9/45 

4/45 

1/3 

0 

7/45 

5/45 

3/45 

1/3 

1 

6/45 

1/45 

8/45 

1/3 

P(X  =  a) 

1/3 

1/3 

1/3 

1 

We  find  E  [X]  =  (— l)-i-|-0-|-|-l-|=0,  and  similarly  E  \Y]  —  0.  By  leaving  out 
terms  where  either  X  =  0  or  E  =  0,  we  find 

e[xe]  =  (-i).(-i).A  +  (_i).i.|  +  i.(_i).A  +  i.i.A  =  o. 

which  implies  that  Cov(X,  E)  =  E  [XE]  -  E  [X]  E  [E]  =  0. 

10.6  b  Note  that  the  variables  X  and  E  in  part  b  are  equal  to  the  ones  from  part  a, 
shifted  by  c.  If  we  write  U  and  V  for  the  variables  from  a,  then  X  =  U  -\-  c  and 
Y  =  V  +  c.  According  to  the  rule  on  the  covariance  under  change  of  units,  we  then 
immediately  find  Cov(X,  E)  =  Cov(f/  -|-  c,  E  -I-  c)  =  Cov(f7,  V)  —  0. 

Alternatively,  one  could  also  compute  the  covariance  from  Cov(X,  E)  =  E[XE]  — 
E  [X]  E  [E] .  We  find  E  [X]  =  (c— =c,  and  similarly  E  [E]  =  c. 
Since 

E[XE]  =  (c-l).(c-l).^  +  (c-l).c.^  +  (c+l).(c+l).^ 

/  -.N  7  5  /  3 

+e.(c-l).-+c.c.-+c.(c+l).- 

d-{c  +  1)  •  (c  —  1)  •  —  +  (c  -I-  1)  •  c  •  —  +  (c  -I-  1)  •  (c  -b  1)  ■  —  =  C^, 
we  find  Cov(X,  E)  =  E  [XE]  -  E  [X]  E  [E]  =  -  c  ■  c  =  0. 
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10.6  c  No,  X  and  Y  are  not  independent.  For  instance,  P(X  =  c,Y  =  c  +  1)  =  1/45, 
which  differs  from  P(X  =  c)  P(y  =  c  +  1)  =  1/9. 

10.9  a  If  the  aggregated  blood  sample  tests  negative,  we  do  not  have  to  perform 
additional  tests,  so  that  Xi  takes  on  the  value  1.  If  the  aggregated  blood  sample 
tests  positive,  we  have  to  perform  40  additional  tests  for  the  blood  sample  of  each 
person  in  the  group,  so  that  Xi  takes  on  the  value  41.  We  first  find  that  P{Xi  =  1)  = 
P(no  infections  in  group  of  40)  =  (1  —  0.001)“*°  =  0.96,  and  therefore  P{Xi  =  41)  = 
1  -  P{Xi  =  1)  =  0.04. 

10.9b  First  compute  E [W]  =  1-0. 96  +  41-0. 04  =  2.6.  The  expected  total  number  of 
tests  isE[Xi  +X2  +  •••  +  X25]  =  E[Wi]+E[X2]  +  ---  +  E[X25]  =  25-2.6  =  65.  With 
the  original  procedure  of  blood  testing,  the  total  number  of  tests  is  25-40  =  1000.  On 
average  the  alternative  procedure  would  only  require  65  tests.  Only  with  very  small 
probability  one  would  end  up  with  doing  more  than  1000  tests,  so  the  alternative 
procedure  is  better. 

10.10  a  We  find 


/oo  ro  cy 

xfx{x)dx^  /  -—[9x^-\-7x^)dx 
-00  Jo 

/oo  }  1 

yfYiy)dy^  _(3/  +  12y2)dy=_ 


2 


9  4^7  3' 

-X  +-X 


_  1^ 
“  50  ’ 

157 

100’ 


so  that  E  [X  +  y]  =  E  [X]  +  E  [y]  =  15/4. 
10.10  b  We  find 


E[X^]=y  xVx(x)dx  =  ^  ^  ^  ^ 

/OO  \  1 

y^fYiy)dy  =  _(33/"  +  l2y°)dr/=  - 


95^74 

-X  +  -X 

5  4 


1287 


250  ’ 


'  — OO 

r3  /'2 


3  5  ,  q  4 
tV  +3y 
5 


318 

125’ 


E[XY]  =  J  j  xyf{x,y)dydx  =  J  j  ^{2x^y‘^  +  x^y^)  dy  dx 


AZ 

75  3 


/' 


3  ,  2  15 

X  dx  +  —  — 
75  4 


/■ 

^0 


2 .  m 

X  dx=— , 


so  that  E  [(X  +  Yf]  =  E  [X^]  +  E  [y^]  +  2E[Xy]  =  3633/250. 
10.10  c  We  find 


Var(X)  =  E[X^]  -  (E[X])^ 
Var(y)  =  E[y^]  -  (E[y])^ 


1287  /109y_  989 

250  V  50  /  ^  2500  ’ 

318  /157y_  791 

125  UOO/  ^  10  000’ 


Var(X  +  y)  =  E  [(X  +  y)^]  -  (E[X  +  y])^  =  ^ 


939 

2000' 


Hence,  Var(X)  +  Var(y)  =  0.4747,  which  differs  from  Var(X  +  y)  =  0.4695. 
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10.14  a  By  using  the  alternative  expression  for  the  covariance  and  linearity  of  ex¬ 
pectations,  we  find 

cov(x  -h  s,  y  -h  u) 

=  E [(X -f  s)(y -h  u)]  -  E [X -f  s]  E  [y -f  u] 

=  E  [xy  +  sY  +  uX  +  su]  -  (E  [X]  +  s)(E  [y]  -h  u) 

=  (E  [xy]  -h  sE  [y]  -h  wE  [X]  -h  su)  -  (E  [X]  E  [y]  -h  sE  [y]  +  uE  [X]  -h  su) 

=  E  [xy]  -  E  [X]  E  [y] 

=  Cov(x,  y) . 

10.14  b  By  using  the  alternative  expression  for  the  covariance  and  the  rule  on 
expectations  under  change  of  units,  we  find 

Cov(rX,  tY)  =  E  [(rX)  (tY)]  -  E  [rX]  E  [tY] 

=  E[rtXy]  -  (rE[X])(tE[y]) 

=  rtE  [XY]  -  rtE  [X]  E  [y] 

=  rt(E[Xy]  -E[X]E[y]) 

=  rtCov(X,  y)  . 

10.14  c  First  applying  part  a  and  then  part  b  yields 

Cov(rX  +  s,tY  +  u)  =  Cov(rX,  tY)  =  rtCov(X,  y) . 


10.18  First  note  that  Xi  -|-  X2  +  •  •  •  -I-  Xjv  is  the  sum  of  all  numbers,  which  is 
a  nonrandom  constant.  Therefore,  Var(Xi -|- X2 -I- •  •  • -h  X]v)  =  0.  In  Section  9.3 
we  argued  that,  although  we  draw  without  replacement,  each  Xi  has  the  same 
distribution.  By  the  same  reasoning,  we  find  that  each  pair  {Xi,Xj),  with  i  7^  j, 
has  the  same  joint  distribution,  so  that  Cov(Xi,Xj)  =  Cov(Xi,X2)  for  all  pairs 
with  i  7^  j.  Direct  application  of  Exercise  10.17  with  =  [N  —  1){N  +  1)  and 
7  =  Cov(Xi,X2)  gives 

0  =  Var(Xi  +X2  +  ---  +  Xn)  =  N-  "  1)(^  +  1)  jv(X  -  l)Cov(Xi,  X2) . 
Solving  this  identity  gives  Cov(Xi,X2)  =  —(X  -|-  1)/12. 

11.2  a  By  using  the  rule  on  addition  of  two  independent  discrete  random  variables, 
we  have 

00 

P(X  +  y  =  fc)  =  pz{k)  =  -  i)pY{l). 

Because  px{ci)  =  0  for  a  <  —1,  all  terms  with  vanish,  so  that 


k 

E{X  +  Y  =  k)  =  Y. 


1  k-^ 

^  -1 
{k-£)\^ 


also  using  (r)  =  2*^  in  the  last  equality. 
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11.2b  Similar  to  part  a,  by  using  the  rule  on  addition  of  two  independent  discrete 
random  variables  and  leaving  out  terms  for  which  px{a)  =  0,  we  have 


fc  yk-t 


_  {\  +  p)*"  ^-{x+^l)^  ( k\  V 


nX  +  Y  =  k)  =  Y.  .  ^e-  =  E 


Next,  write 


A 


(A  +  \X  fx  J  \^A  +  /ry 
with  p  =  jj,/{X  +  p).  This  means  that 


A  +  p 


1  - 


A  +  /r 


£  {X  +  p)'^’ 


=  P  (1  -P) 


p(x  +  y  =  fc) 


-  l-i  -  +  „-(>+,.)  ■ 


k\ 


j: 


(A  + /r)'" 
~  fc! 


using  that  Qp‘{l-p)'^  ^  =  1. 

11.4  a  From  the  fact  that  X  has  an  N{2,5)  distribution,  it  follows  that  E[X]  = 
2  and  Var(X)  =  5.  Similarly,  E  [V]  =5  and  Var(y )  =  9.  Hence  by  linearity  of 
expectations. 


E  [Z]  =  E  [3X  -  2y  +  1]  =  3E  [X]  -  2E  [y]  +  1  =  3  •  2  -  2  •  5  +  1  =  -3. 


By  the  rules  for  the  variance  and  covariance, 

Var(Z)  =  9Var(X)  +  4Var(y)  -  12Cov(X,  y)  =  9  •  5  +  4  •  9  -  12  •  0  =  81, 


using  that  Cov(X,  y)  =  0,  due  to  independence  of  X  and  Y. 

11.4  b  The  random  variables  3X  and  — 2y  +  1  are  independent  and,  according  to 
the  rule  for  the  normal  distribution  under  a  change  of  units  (page  106),  it  follows 
that  they  both  have  a  normal  distribution.  Next,  the  sum  rule  for  independent 
normal  random  variables  then  yields  that  Z  =  (3X)  +  (— 2y  +  1)  also  has  a  normal 
distribution.  Its  parameters  are  the  expectation  and  variance  of  Z.  From  a  it  follows 
that  Z  has  an  N{—3,  81)  distribution. 

11.4  c  From  b  we  know  that  Z  has  an  7V(— 3, 81)  distribution,  so  that  {Z  +  3)/9 
has  a  standard  normal  distribution.  Therefore 


PiZ<6)  =  p(^^  < 


6  +  3 
9 


where  "F  is  the  standard  normal  distribution  function.  From  Table  B.l  we  find  that 
$(1)  =  1  -  0.1587  =  0.8413. 

11.9  a  According  to  the  product  rule  on  page  160, 


fz{z) 


1  r  1 

x'^ 


1  3  1 


da: 


■  dx  = 


1  -2 
-2* 


3_1_ 

2  ^2 


1  - 
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11.9  b  According  to  the  product  rule, 


fz{z)  = 


fx{x)-dx  = 
X 


fii 


a 


aP 

TFTT 


^  d*  = 


a[3 

7^ 


(f) 

0  —  oc 


_ H  T 

/3+1  3;Q!+1  2^ 


13  —  a 


aP 


1 


a  —  P  2^+1 


aP  (J, _ 1_\ 

P  —  a  2“+i / 


12.1  e  This  is  certainly  open  to  discussion.  Bankruptcies:  no  (they  come  in  clusters, 
don’t  they?).  Eggs:  no  (I  suppose  after  one  egg  it  takes  the  chicken  some  time  to 
produce  another).  Examples  3  and  4  are  the  best  candidates.  Example  5  could  be 
modeled  by  the  Poisson  process  if  the  crossing  is  not  a  dangerous  one;  otherwise 
authorities  might  take  measures  and  destroy  the  homogeneity. 

12.6  The  expected  numbers  of  flaws  in  1  meter  is  100/40  =  2.5,  and  hence 
the  number  of  flaws  X  has  a  Pois{2.5)  distribution.  The  answer  is  P(X  =  2)  = 
i(2.5)2e-2®  =  0.256. 

12.7  a  It  is  reasonable  to  estimate  A  with  (nr.  of  cars) /(total  time  in  sec.)  =  0.192. 

12.7  b  19/120  =  0.1583,  and  if  A  =  0.192  then  P(iV(10)  =  0)  =  =  0.147. 

12.7  c  P(Af(10)  =  10)  with  A  from  a  seems  a  reasonable  approximation  of  this  prob¬ 
ability.  It  equals  e^^  ®^  •  (0.192  •  10)i°/10!  =  2.71  •  10“®. 

12.11  Following  the  hint,  we  obtain: 


P{N{[0,  s]  =  k,  iV([0,  2s])  =  n)  =  P(N{[0,  s])  =  k,  N{{s,  2s])  =  n  -  k) 

=  P(Af([0,  s])  =  k)  ■  P{N{{s,  2s])  =n-k) 
=  (As)V^7(fc!)  •  (As)’"-V^7((n-fc)!) 
=  (As)"'e~^^'’/(A:!(n  —  A:)!). 


So 


P(iV([0,s])  =  fc|iV([0,2s]) 


^  P(jV([0,s])=fc,iV([0,2g])  =  7 

P(iV([0,2s])  =  n) 

=  n!/(A:!(n-A:)!)  •  (As)’"/(2As)" 

=  n!/(A:!(n-A:)!)  ■  (1/2)". 


This  holds  for  fc  =  0, . . . ,  n,  so  we  find  the  Bin{n,  distribution. 

13.2  a  From  the  formulas  for  the  U{a,b)  distribution,  substituting  a  =  —1/2  and 
b  =  1/2,  we  derive  that  E[Jfi]  =  0  and  Var(Xi)  =  1/12. 

13.2  b  We  write  S  =  Xi+X2  +  --  ■  +  Xioo,  for  which  we  find  E  [S']  =  E  [Xi]  -h  •  •  •  -f 
ElAioo]  =  0  and,  by  independence,  Var(S)  =  Var(Xi)  +  •  •  •  -|- Var(A'ioo)  =  100-  ^  = 
100/12.  We  find  from  Chebyshev’s  inequality: 


P(|S|>10)=P(|S-0|>10)<^^^ 


_1_ 

12' 
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13.4  a  Because  Xi  has  a  Ber{p)  distribution,  E[Xi] 
and  so  E  [X„]  =  p  and  Var(X„)  =  Var(Xi)  /n  = 
inequality: 


P(|Xn 


p\  >  0.2)  < 


p{l-p)/n 

(0.2)2 


=  p  and  Var(Xi)  —  p{l  —  p), 
p{l  —  p)jn.  By  Chebyshev’s 

25p(l  -p) 
n 


The  right-hand  side  should  be  at  most  0.1  (note  that  we  switched  to  the  comple¬ 
ment).  If  p  =  1/2  we  therefore  require  25/(4n)  <  0.1,  or  n  >  25/(4  •  0.1)  =  62.5, 
i.e.,  n  >  63.  Now,  suppose  p  ^  1/2,  using  n  =  63  and  p(l  —  p)  <  1/4  we  conclude 
that  25p(l  —  p)/n  <  25  •  (l/4)/63  =  0.0992  <  0.1,  so  (because  of  the  inequality)  the 
computed  value  satisfies  for  other  values  of  p  as  well. 


13.4  b  For  arbitrary  a  >  0  we  conclude  from  Chebyshev’s  inequality: 


P(|X„ 


Pl  >  a)  < 


p{l-p)/n 


P(1  -P)  ^  1 

na2  ~  4na2 ’ 


where  we  used  p  (1  —  p)  <  1/4  again.  The  question  now  becomes:  when  a  —  0.1,  for 
what  n  is  l/(4na^)  <  0.1?  We  find:  n  >  1/(4-  0.1  •  (0.1)^)  =  250,  so  n  =  250  is  large 
enough. 

13.4c  From  part  a  we  know  that  an  error  of  size  0.2  or  occur  with  a  probability 
of  at  most  25/4n,  regardless  of  the  values  of  p.  So,  we  need  25/(4n)  <  0.05,  i.e., 
n  >  25/(4-0.05)  =  125. 

13.4  d  We  compute  P(X„  <  0.5)  for  the  case  that  p  =  0.6.  Then  E  =  0.6 
and  Var(X„)  =  0.6  •  0.4/n.  Chebyshev’s  inequality  cannot  be  used  directly,  we  need 
an  intermediate  step:  the  probability  that  Xn  <  0.5  is  contained  in  the  event  “the 
prediction  is  off  by  at  least  0.1,  in  either  direction.”  So 


P(X„  <0.5)  <P(|X„ 


0.6|  >  0.1)  < 


0.6- 0.4/n 

(0.1)2 


■M 

n 


For  n  >  240  this  probability  is  0.1  or  smaller. 

13.9  a  The  statement  looks  like  the  law  of  large  numbers,  and  indeed,  if  we  look 
more  closely,  we  see  that  T„  is  the  average  of  an  i.i.d.  sequence:  define  Yi  —  Xf, 
then  Tn  =  Y„.  The  law  of  large  numbers  now  states:  if  Yn  is  the  average  of  n 
independent  random  variables  with  expectation  p  and  variance  ,  then  for  any 
£  >  0:  limn^oo  P(|yn  —  /r|  >  e)  =  0.  So,  if  a  =  p  and  the  variance  is  finite,  then 
it  is  true. 

13.9 b  We  compute  expectation  and  variance  of  Yp.  E [Yi]  =  E  \Xi^  =  da;  = 

1/3.  And:  E  [Y^]  =  E  [Xf]  =  ^xUx  =  1/5,  so  Var(V;)  =  1/5  -  (1/3)^  =  4/45. 
The  variance  is  finite,  so  indeed,  the  law  of  large  numbers  applies,  and  the  statement 
is  true  if  a  =  E  [Xfj  =  1/3. 

14.3  First  note  that  P(|  —  p|  <  0.2)  =  1  — P(X„  —  p>  0.2)  — P(X„  —  p  <  —0.2) . 

Because  p  —  p  and  =  p(l  —  p),  we  find,  using  the  central  limit  theorem: 


P(Xn  -  P>  0.2)  =  P(  -i/n  ^  =  >  yjn 

\/p(l-p) 


=  P  Zn>  Vn- 


0.2 


Vp(i  -p) 


0.2 

\/p(l  -P), 
i  P  I  Z  >  Xrf- 


0.2 


Vp(i  -p) 
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where  Z  has  an  N{0, 1)  distribution.  Similarly, 


P(Xn-p<  -0.2)  ^P[Z> 


0.2 


-p) 

so  we  are  looking  for  the  smallest  positive  integer  n  such  that 

0.2 


l-2P[  Z>  Jn- 


\/p(l  -P) 

i.e.,  the  smallest  positive  integer  n  such  that 

0.2 


>  0.9, 


Pi  Z>^/n 


Vp(i  -p) 


<  0.05. 


From  Table  B.l  it  follows  that 


y/n 


0.2 


\/p(l  -P) 


>  1.645. 


Since  p{l  —  p)  <  1/4  for  all  p  between  0  and  1,  we  see  that  n  should  be  at  least  17. 
14.5  In  Section  4.3  we  have  seen  that  X  has  the  same  probability  distribution 
as  X\  +  X2  +  •  •  •  +  Xn,  where  Xi, X2, . . . , X„  are  independent  Ber(p)  distributed 
random  variables.  Recall  that  E  [Xi]  =  p,  and  Var(Xi)  =  p(l  —p).  But  then  we  have 
for  any  real  number  a  that 


X  —  np 
^np{l  -p) 


<  a  =  P 


+  X2  +  •  •  •  +  Xji  —  np 
^np{l-p) 


<  n  J  —  P[Zn  < 


see  also  (14.1).  It  follows  from  the  central  limit  theorem  that 


X  —  np 
^/np{l-p) 


$(a). 


i.e.,  the  random  variable 
normal. 


X  —np 
\J  np(l-p) 


has  a  distribution  that  is  approximately  standard 


14.9  a  The  probability  that  for  a  chain  of  at  least  50  meters  more  than  1002  links 
are  needed  is  the  same  as  the  probability  that  a  chain  of  1002  chains  is  shorter  than 
50  meters.  Assuming  that  the  random  variables  Xi,  X2, . . . ,  X1002  are  independent, 
and  using  the  central  limit  theorem,  we  have  that 

/  _  5000  _  gx 

P(Xi  +  X2  +  •  •  •  +  X1002  <  5000)  «  P <  71002  ■  422|^  j  ^  0.0571, 

where  Z  has  an  X(0, 1)  distribution.  So  about  6%  of  the  customers  will  receive  a 
free  chain. 

14.9  b  We  now  have  that 


P(Xi  +  X2  +  •  •  •  +  X1002  <  5000)  «  P(Z  <  0.0032)  , 

which  is  slightly  larger  than  1/2.  So  about  half  of  the  customers  will  receive  a  free 
chain.  Clearly  something  has  to  be  done:  a  seemingly  minor  change  of  expected  value 
has  major  consequences! 
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15.6  Because  (2  -  0)  •  0.245  +  (4  -  2)  ■  0.130  +  (7  -  4)  ■  0.050  +  (11  -  7)  •  0.020  + 
(15  —  11)  •  0.005  =  1,  there  are  no  data  points  outside  the  listed  bins.  Hence 


= 


number  oi  Xi  <  7 


n 


number  of  Xi  in  bins  (0,  2],  (2, 4]  and  (4,  7] 
n 

_  n  ■  (2  -  0)  ■  0.245  +  n  ■  (4  -  2)  ■  0.130  +  n  ■  (7  -  4)  ■  0.050 

n 

=  0.490  +  0.260  +  0.150  =  0.9. 

15.11  The  height  of  the  histogram  on  a  bin  {a,b]  is 

number  of  Xi  in  (a,  b]  (number  of  Xi  <  b)  -  (number  of  Xi  <  a) 


n(b  —  a) 


n(b  —  a) 


K(5)-K(fl) 

b  —  a 


15.12  a  By  inserting  the  expression  for  fn,hit),  we  get 

/OO  POO  1  ^  / 

-lU 


t  —  Xi 


dt 


f  t  -  Xi 


dt. 


For  each  i  fixed  we  find  with  change  of  integration  variables  u  =  {t  —  Xi)/h, 


■r 


K  {u)  dud-  h  J  uK  (u)  dw  =  Xi 


using  that  K  integrates  to  one  and  that  uK  (u)  du  —  0,  because  K  is  symmetric. 
Hence 

/"  *  ■  /,..(()  d, .  1  g  Ia-  (42i)  d, .  i  g  ... 

15.12  b  By  means  of  similar  reasoning 


f 


t  •  fn,h{t)  dt  = 


/OO  1 

-oc 

1  /’OO  J.2 


K 


t  —  Xi 


dt 


t  —  Xi 


dt. 


For  each  v. 
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-i: 

/C 

-  c 


t  —  Xi 

h 


dt 


'f 

J  —  c 


■/ 


/ 


{xi  +  huyK  (u)  du=  {Xi  +  2xihu  +  n  u)K  (u)  du 


K  {u)  du  +  2xih  /  uK  (u)  du  +  n 


u  K  (u)  du 


=  Xi  +  h?  I  K  (u)  du, 


again  using  that  K  integrates  to  one  and  that  K  is  symmetric. 

16.3  a  Because  n  =  24,  the  sample  median  is  the  average  of  the  12th  and  13th 
elements.  Since  these  are  both  equal  to  70,  the  sample  median  is  also  70.  The  lower 
quartile  is  the  pth  empirical  quantile  for  p  =  1/4.  We  get  k  =  [p(n  +  1)J  =  6,  so 
that 


g„(0.25)  =  X(Q)  +  0.25  •  (*(7)  -  x^e))  =  66  +  0.25  •  (67  -  66)  =  66.25. 

Similarly,  the  upper  quartile  is  the  pth  empirical  quantile  for  p  =  3/4: 

q„(0.75)  =  a:(i8)  +  0.75  •  (a:(i9)  -  a:(i8))  =  75  +  0.75  •  (75  -  75)  =  75. 

16.3  b  In  part  a  we  found  the  sample  median  and  the  two  quartiles.  From  this  we 
compute  the  IQR:  gn(0.75)  —  q„(0.25)  =  75  —  66.25  =  8.75.  This  means  that 

g„(0.25)  -  1.5  ■  IQR  =  66.25  -  1.5  •  8.75  =  53.125, 
g„(0.75)  +  1.5  ■  IQR  =  75  +  1.5  •  8.75  =  88.125. 

Hence,  the  last  element  below  88.125  is  88,  and  the  first  element  above  53.125  is  57. 
Therefore,  the  upper  whisker  runs  until  88  and  the  lower  whisker  until  57,  with  two 
elements  53  and  31  below.  This  leads  to  the  following  boxplot: 


16.3  c  The  values  53  and  31  are  outliers.  Value  31  is  far  away  from  the  bulk  of  the 
data  and  appears  to  be  an  extreme  outlier. 

16.6a  Yes,  we  find  x  =  (l  +  5  +  9)/3  =  15/3  =  5,  p  =  (2  +  4  +  6  +  8)/4  =  20/4  =  5, 
so  that  {x  +  p)/2  =  5.  The  average  for  the  combined  dataset  is  also  equal  to  5: 
(15  +  20)/7  =  5. 

16.6  b  The  mean  of  2:1, 0:2, . . . ,  Xn,  pi,  P2, .  ■ .  ,ym  eqnals 


Xl-\ - +Xn+yi-\ - h  Pr, 

n  +  m 


nxn  +  mym 
n  +  m 


n 


n  +  m 


Xn  “1“ 


m 


n  +  m 
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In  general,  this  is  not  equal  to  {x„-\-ym) l‘2-  For  instance,  replace  1  in  the  first  dataset 
by  4.  Then  a:„  =  6  and  =  5,  so  that  {xn  +  ym)/2  =  5^.  However,  the  average  of 
the  combined  dataset  is  38/7  =  5|. 

16.6  c  Yes,  m  =  n  implies  n/(n  +  m)  =  ml(n  +  m)  =  1/2.  From  the  expressions 
found  in  part  b  we  see  that  the  sample  mean  of  the  combined  dataset  equals  (xn  + 

?/m)/2. 

16.8  The  ordered  combined  dataset  is  1,  2,  4,  5,  6,  8,  9,  so  that  the  sample  median 
equals  5.  The  absolute  deviations  from  5  are:  4,  3,  1,  0,  1,  3,  4,  and  if  we  put  them  in 
order:  0,  1,  1,  3,  3,  4,  4.  The  MAD  is  the  sample  median  of  the  absolute  deviations, 
which  is  3. 

16.15  First  write 


Y^{xi-x^f  =  -Y^{xl-2x„Xi+xl)  =  -X^2:?-2a:„-^a:i  +  -^: 


Next,  by  inserting 


1 

n 


Y^xi 


=  Xn 


and 


i=l 


1  -2 

- 

n 


=  X 


2 

n  ? 


we  find 


1 

n 


1 

n 


1 

n 


n 


i=l  i=l  i=l 

17.3  a  The  model  distribution  corresponds  to  the  number  of  women  in  a  queue.  A 
queue  has  10  positions.  The  occurrence  of  a  woman  in  any  position  is  independent 
of  the  occurrence  of  a  woman  in  other  positions.  At  each  position  a  woman  occurs 
with  probability  p.  Counting  the  occurrence  of  a  woman  as  a  “success,”  the  number 
of  women  in  a  queue  corresponds  to  the  number  of  successes  in  10  independent 
experiments  with  probability  p  of  success  and  is  therefore  modeled  by  a  Bin{lQ,p) 
distribution. 


17.3  b  We  have  100  queues  and  the  number  of  women  Xi  in  the  ith  queue  is  a 
realization  of  a  Bin{10,p)  random  variable.  Hence,  according  to  Table  17.2,  the 
average  number  of  women  aiioo  resembles  the  expectation  lOp  of  the  Bm(10,p) 
distribution.  We  find  xioo  ~  435/100  =  4.35,  so  an  estimate  for  p  is  4.35/10  =  0.435. 

17.7  a  If  we  model  the  series  of  disasters  by  a  Poisson  process,  then  as  a  property  of 
the  Poisson  process,  the  interdisaster  times  should  follow  an  exponential  distribution 
(see  Section  12.3).  This  is  indeed  confirmed  by  the  histogram  and  empirical  distri¬ 
bution  of  the  observed  interdisaster  times;  they  resemble  the  probability  density  and 
distribution  function  of  an  exponential  distribution. 

17.7  b  The  average  length  of  a  time  interval  is  40  549/190  =  213.4  days.  Following 
Table  17.2  this  should  resemble  the  expectation  of  the  Exp{X)  distribution,  which 
is  1/A.  Hence,  as  an  estimate  for  A  we  could  take  190/40  549  =  0.00469. 

17.9  a  A  (perfect)  cylindrical  cone  with  diameter  d  (at  the  base)  and  height  h  has 
volume  ird^hlVl,  or  about  0.26d^h.  The  effective  wood  of  a  tree  is  the  trunk  without 
the  branches.  Since  the  trunk  is  similar  to  a  cylindrical  cone,  one  can  expect  a  linear 
relation  between  the  effective  wood  and  d^h. 
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17.9  b  We  find 


 9-369 


y/x  = 


least  squares  = 


n 

{T,yi)/n 

iJ2xi)/n 

Y.Xiyi 


=  0.3022 


31 

_  26.486/31 
'  87.456/31 
95.498 


=  0.3028 


314.644 


=  0.3035. 


18.3  a  Note  that  generating  from  the  empirical  distribution  function  is  the  same  as 
choosing  one  of  the  elements  of  the  original  dataset  with  equal  probability.  Hence, 
an  element  in  the  bootstrap  dataset  equals  0.35  with  probability  0.1.  The  number 
of  ways  to  have  exactly  three  out  of  ten  elements  equal  to  0.35  is  (3°),  and  each  has 
probability  (0.1)® (0.9)^.  Therefore,  the  probability  that  the  bootstrap  dataset  has 
exactly  three  elements  equal  to  0.35  is  equal  to  ( 3®)  (0.1)®(0.9)^  =  0.0574. 

18.3  b  Having  at  most  two  elements  less  than  or  equal  to  0.38  means  that  0,  1, 
or  2  elements  are  less  than  or  equal  to  0.38.  Five  elements  of  the  original  dataset 
are  smaller  than  or  equal  to  0.38,  so  that  an  element  in  the  bootstrap  dataset  is 
less  than  or  equal  to  0.38  with  probability  0.5.  Hence,  the  probability  that  the 
bootstrap  dataset  has  at  most  two  elements  less  than  or  equal  to  0.38  is  equal  to 
(0.5)1°  ^10^  (Q  5)10  ^10^  (Q  5)10  ^  0.0547. 

18.3  c  Five  elements  of  the  dataset  are  smaller  than  or  equal  to  0.38  and  two 
are  greater  than  0.42.  Therefore,  obtaining  a  bootstrap  dataset  with  two  elements 
less  than  or  equal  to  0.38,  and  the  other  elements  greater  than  0.42  has  probabil¬ 
ity  (0.5)^  (0.2)®.  The  number  of  such  bootstrap  datasets  is  (2°)-  answer  is 

(2°)  (0.5)^  (0.2)®  =  0.000029. 

18.7  For  the  parametric  bootstrap,  we  must  estimate  the  parameter  6  hj  9  = 
in  -b  l)mnln,  and  generate  bootstrap  samples  from  the  U{0,9)  distribution.  This 
distribution  has  expectation  y.g  =  9/2  =  {n-\-  l)m„/(2n).  Hence,  for  each  bootstrap 
sample  xl,X2,  ■  ■  ■  ,Xn  compute  a;*  —  /ig  =  a:*  —  (n  -b  l)m„/(2n). 

Note  that  this  is  different  from  the  empirical  bootstrap  simulation,  where  one  would 
estimate  by  Xn  and  compute  —  Xn- 

18.8  a  Since  we  know  nothing  about  the  distribution  of  the  interfailure  times,  we 
estimate  F  by  the  empirical  distribution  function  Fn  of  the  software  data  and  we 
estimate  the  expectation  ^  of  T  by  the  expectation  fi*  =  Xn  =  656.8815  of  Fn. 
The  bootstrapped  centered  sample  mean  is  the  random  variable  X/  —  656.8815.  The 
corresponding  empirical  bootstrap  simulation  is  described  as  follows: 

1.  Generate  a  bootstrap  dataset  Xi,X2,  ■  ■  ■  ,Xn  from  Fn,  i.e.,  draw  with  replacement 
135  numbers  from  the  software  data. 

2.  Compute  the  centered  sample  mean  for  the  bootstrap  dataset: 


Xn  -  656.8815 


where  Xn  is  the  sample  mean  oi  Xi,X2,  ■  ■  ■  ,Xn. 

Repeat  steps  1  and  2  one  thousand  times. 

18.8  b  Because  the  interfailure  times  are  now  assumed  to  have  an  Exp{X)  distribu¬ 
tion,  we  must  estimate  A  by  A  =  1/xn  =  0.0015  and  estimate  F  by  the  distribution 
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function  of  the  -Exp  (0.0015)  distribution.  Estimate  the  expectation  ^  =  1/A  of  the 
Exp  (A)  distribution  hy  fi*  =  1/X  =  Xn  =  656.8815.  Also  now,  the  bootstrapped 
centered  sample  mean  is  the  random  variable  A/  —  656.8815.  The  corresponding 
parametric  bootstrap  simulation  is  described  as  follows: 

1.  Generate  a  bootstrap  dataset  Xi,X2, . . .  ,x*  from  the  Exp (0.0015)  distribution. 

2.  Compute  the  centered  sample  mean  for  the  bootstrap  dataset: 

x*„  -  656.8815, 

where  Xn  is  the  sample  mean  of  x*,  X2, . . . ,  x)). 

Repeat  steps  1  and  2  one  thousand  times.  We  see  that  in  this  simulation  the  boot¬ 
strapped  centered  sample  mean  is  the  same  in  both  cases:  A*  —  x„,  but  the  corre¬ 
sponding  simulation  procedures  differ  in  step  1. 

18.8  c  Estimate  A  by  A  =  ln2/m„  =  0.0024  and  estimate  E  by  the  distribution 
function  of  the  Exp  (0.0024)  distribution.  Estimate  the  expectation  fj,  —  1/A  of  the 
Exp{X)  distribution  hy  p*  =  1/A  =  418.3816.  The  corresponding  parametric  boot¬ 
strap  simulation  is  described  as  follows: 

1.  Generate  a  bootstrap  dataset  x*,  X2, . . . ,  x)(  from  the  Exp (0.0024)  distribution. 

2.  Compute  the  centered  sample  mean  for  the  bootstrap  dataset: 

x;  -418.3816, 


where  Xn  is  the  sample  mean  of  x*,  X2, . . . ,  xj). 

Repeat  steps  1  and  2  one  thousand  times.  We  see  that  in  this  parametric  bootstrap 
simulation  the  bootstrapped  centered  sample  mean  is  different  from  the  one  in  the 
empirical  bootstrap  simulation:  A))  —  (ln2)/mrt  instead  of  A))  —  x„. 

19.1a  From  the  formulas  for  the  expectation  and  variance  of  uniform  random 
variables  we  know  that  E[A’i]  =  0  and  Var(Xi)  =  (26')^/12  =  0^/3.  Hence 
E  [a?]  =  Var(Xi)  -|-  (E[Xi])^  =  0^/3.  Therefore,  by  linearity  of  expectations 


E[r] 


3 

n 


Since  E[T]  =  0^,  the  random  variable  T  is  an  unbiased  estimator  for  6^. 

19.1b  The  function  g{x)  =  —ffx  is  a  strictly  convex  function,  because  g"{x)  = 
(x”®/'^)/!  >  0.  Therefore,  by  Jensen’s  inequality,  —ff'E  [T]  <  — E  |^V^j  •  Since,  from 

part  a  we  know  that  E[T]  =  9^,  this  means  that  E  <  6.  In  other  words,  \/r 

is  a  biased  estimator  for  9,  with  negative  bias. 

19.8  From  the  model  assumptions  it  follows  that  E  [Yi]  =  /3xi  for  each  i.  Using 
linearity  of  expectations,  this  implies  that 

i  + . . .  +  +  + 

n  \  Xi  Xn  J  n  \  Xi  Xn  ) 

E[yi]-k----kE[yn]  ^  /3xi  +  ■  ■  ■  -f  /3x„  ^ 

xi  H - -I-  x„  xi  -1 - -I-  x„ 

XlE  [Yi]  -|-  •  •  •  -|-  XnE  [Yn]  _  /3xi  -|-  •  •  •  -f  PXn  _  ^ 

Xf  -I - +Xl  Xf  -I - -|-X2 


E[Ei]  = 
E[E2]  = 
E[E3]  = 
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20.2  a  Compute  the  mean  squared  errors  of  S  and  T:  MSE(S)  =  Var(5')  + 
[bias(S)]^  =  40  +  0  =  40;  MSE(r)  =  Var(r)  +  [bias(r)]=  =  4  +  9  =  13.  We  prefer  T, 
because  it  has  a  smaller  MSE. 

20.2  b  Compute  the  mean  squared  errors  of  S  and  T:  MSE(S')  =  40,  as  in  a; 
MSE(T)  =  Var(T)  +  [bias(T)]^  =  4  +  a^.  So,  if  a  <  6:  prefer  T.  If  a  >  6:  prefer  S. 
The  preferences  are  based  on  the  MSE  criterion. 

20.3  Var(ri)  =  l/(nA^),  Var(r2)  =  1/A^;  hence  we  prefer  Ti,  because  of  its  smaller 
variance. 

20.8  a  This  follows  directly  from  linearity  of  expectations: 

E[r]  =  E  [rX^  +  {l  -  r)Yj^  =  rE  [X„]  +  (1  -  r)E  [Wn]  =  r/r  +  (1  -  r)^  =  fi. 

20.8  b  Using  that  Xn  and  Ym  are  independent,  we  hnd  MSE(r)=Var(r)  = 
r^Var(X„)  +  (1  —  r)^Var(Ym,)  =  jn  +  (1  —  Im. 

To  hnd  the  minimum  of  this  parabola  we  differentiate  with  respect  to  r  and 
equate  the  result  to  0:  2r/n  —  2(1  —  r)/m  =  0.  This  gives  the  minimum  value: 
2rm  —  2n(l  —  r)  =  0  or  r  =  n/(n  +  m). 

21.1  Setting  Xi  =  j  if  red  appears  in  the  ith  experiment  for  the  hrst  time  on  the 
jth  throw,  we  have  that  Xi,  X2,  and  X3  are  independent  Geo{p)  distributed  random 
variables,  where  p  is  the  probability  that  red  appears  when  throwing  the  selected 
die.  The  likelihood  function  is 

L{p)  =  P(Xi  =  3,  X2  =  5,  X3  =  4)  =  (1  -  pfp  ■  (1  -  p)%  •  (1  -  pfp 

so  for  Di  one  has  that  L{p)  =  i(|)  =  (|)^  (l  —  |)^,  whereas  for  D2  one  has  that 
L{p)  =  T(|)  =  (1)^  (1  “  1)^  =  5®  •  T(|).  It  is  very  likely  that  we  picked  D2. 

21.4  a  The  likelihood  L{fi)  is  given  by 

=  P(Jfl  =  Xl,  .  .  .  ,Xn  =  Xn)  =  P(-Yl  =  Xl)  -  ■  ■  P{Xn  =  Xn) 

11^1  ii^n  p-TlM 

_  _  .  „-M  .  .  .  ^ _  a:i+a:2H - \-^n 

Xl\  Xn\  Xl\---Xn\ 

21.4  b  We  hnd  that  the  loglikelihood  £{p)  is  given  by 

n  \ 

Xi  ln(/r)  —  In  (xi!  •  •  •  a:„!)  —  np. 

i=i  / 

Hence 

d£ 

—  = - n, 

afi  p. 

and  we  hnd — after  checking  that  we  indeed  have  a  maximum! — that  Xn  is  the  max¬ 
imum  likelihood  estimate  for  p. 

21.4  c  In  b  we  have  seen  that  Xn  is  the  maximum  likelihood  estimate  for  p.  Due  to 
the  invariance  principle  from  Section  21.4  we  thus  hnd  that  e“^"  is  the  maximum 
likelihood  estimate  for  e”''. 


466  D  Full  solutions  to  selected  exercises 


21.8  a  The  likelihood  L{6)  is  given  by 


L{e)  =  C-[-{2  +  e) 


^  (2  +  •  (1  -  e) 


4 

1810 


(1-0) 


-(1-0) 


43839 


where  C  is  the  number  of  ways  we  can  assign  1997  starchy-greens,  32  sugary-whites, 
906  starchy-whites,  and  904  sugary-greens  to  3839  plants.  Hence  the  loglikelihood 
£{d)  is  given  by 


£{e)  =  In(C')  -  3839 ln(4)  +  19971n(2 -b  0)  -b  321n(0)  +  1810 ln(l  -  0). 


21.8  b  A  short  calculation  shows  that 

=  0  38100^  -  16550  -  64  =  0, 


so  the  maximum  likelihood  estimate  of  0  is  (after  checking  that  L{9)  indeed  attains 
a  maximum  for  this  value  of  0): 


-1655  -b  V3714385 
7620 


=  0.0357. 


21.8  c  In  this  general  case  the  likelihood  L{9)  is  given  by 


Li9)=C-  (^^(2  +  0) 
C 


-,(1-0) 


=  ^  •  (2  +  0)"1  •  0"=  •  (1  -  0)"3-t-4^ 


where  C  is  the  number  of  ways  we  can  assign  ni  starchy-greens,  n2  sugary-whites, 
ns  starchy-whites,  and  rn  sugary-greens  to  n  plants.  Hence  the  loglikelihood  £{9)  is 
given  by 


£{9)  =  ln(C)  —  nln(4)  -b  ni  ln(2  -b  0)  +  n2  ln(0)  -b  (ns  -b  rn)  ln(l  —  0). 

A  short  calculation  shows  that 

=0  <bb-  n0^  —  (ni  —  m  —  2n3  —  2ni)9  —  2n2  =  0, 
d0 

so  the  maximum  likelihood  estimate  of  0  is  (after  checking  that  1/(0)  indeed  attains 
a  maximum  for  this  value  of  0): 

ni  —  n2  —  2n3  —  2n4  -b  y/ (ni  —  n2  —  2n3  —  2n4,)^  +  8nn2 

2n 

21.11a  Since  the  dataset  is  a  realization  of  a  random  sample  from  a  Geo{l/N) 
distribution,  the  likelihood  is  L{N)  =  P(Ai  =a:i,A2  =  X2,  ■  ■  ■ ,  Xn  =  Xn),  where 
each  Xi  has  a  Geo{l/N)  distribution.  So 

n)  N\  n)  A'  "A  Nj  N 

1  \  (-'‘+£"=1  /  1  \  " 

n)  [nJ  ■ 
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But  then  the  loglikelihood  is  equal  to 

£{N)  =  —nln  N  +  ^  —  n  +  ^  In  . 

Differentiating  to  N  yields 

^  im)  =  +  ( -  ^ + E  N^N-iy 

Now  3^  (£(N))  =  0  if  and  only  if  N  =  Xn-  Because  i{N)  attains  its  maximum  at 

Xn,  we  find  that  the  maximum  likelihood  estimate  oi  N  is  N  =  Xn- 

21.11  b  Since  P(y  =  k)  —  1/N  for  fc  =  1,  2, . . . ,  A^,  the  likelihood  is  given  by 

foriV>y(„), 

and  L{N)  =  0  for  N  <  y(n)-  So  L{N)  attains  its  maximum  at  j/(„);  the  maximum 
likelihood  estimate  of  N  is  N  =  i/(„) . 

22.1a  Since  =  12.4,  51^  =  9,  j/i  =  4.8,  =  35,  and  n  =  3,  we  find 

(c.f.  (22.1)  and  (22.2)),  that 

A  ^  nJ^XiVi  -  (Ea^0(Ei/0  ^  3  •  12.4  -9-4.8  ^  _  1 
nY^xj  —  {Y^Xiy  3-35  —  92  4’ 

and  a  =  yn  —  Pxn  =  2.35. 

22.1  b  Since  Vi  =  yi  —  a  —  pXi,  for  i  =  1, . . . ,  n,  we  find  ri  =  2  —  2.35  +  0.25  =  —0.1, 
ra  =  1.8  -  2.35  +  0.75  =  0.2,  rs  =  1  -  2.35  +  1.25  =  -0.1,  and  n  +  ra  +  rg  = 
-0.1 +  0.2 -0.1  =  0. 

22.1c  See  Figure  D.l. 


0  1  2  3  4  5 


-1 


Fig.  D.l.  Solution  of  Exercise  22.1c. 


n 

6 


468  D  Full  solutions  to  selected  exercises 


22.5  With  the  assumption  that  q  =  0,  the  method  of  least  squares  tells  us  now  to 
minimize 


Now 


dSjP) 

d/3 


-S'(/3)  =  [yi-Pxif  . 


=  -2  ^(j/i  -  I3xi)xi  =  -2  (  ^  XiVi  - 


dS{f3)  _  _  ^iVi 

d/3  0  /3  . 

Because  S(/3)  has  a  minimum  for  this  last  value  of  (5,  we  see  that  the  least  squares 
estimator  /3  of  /3  is  given  by 

ELi  XiY. 


/3  = 


22.12  a  Since  the  denominator  of  /3  is  a  number,  not  a  random  variable,  one  has 
that 

^  E  [n(^  XjYi)  -  Yj)] 


Furthermore,  the  numerator  of  this  last  fraction  can  be  written  as 


which  is  equal  to 


n;^(xiE[y,])-(^a;0^E[y,]. 

22.12  b  Substituting  E  [Yi]  =  a  +  fdxi  in  the  last  expression,  we  find  that 
gioi  nY.{xi{a  +  Pxi))  -  {Y,Xi)  [E(Q  +  /3a:0] 


22.12  c  The  numerator  of  the  previous  expression  for  E  |^/3j  can  be  simplified  to 

naJ2xi  +  n/3  E  - ’T-Q!  E  - /3(E  ®i)(E  *0 


which  is  equal  to 


/3(wE4  - 


22.12  d 


«E^f-(E*0^ 

From  c  it  now  follows  that  E  j^/3j  =  /3,  i.e.,  /3  is  an  unbiased  estimator  for  /3. 

23.5  a  The  standard  confidence  interval  for  the  mean  of  a  normal  sample  with 
unknown  variance  applies,  with  n  =  23,  x  =  0.82  and  s  =  1.78,  so: 

s_  s  \ 

X  —  122,0.025  ■  ! - ,  X  +  122,0.025  '  , -  I  . 

v23  \/23/ 

The  critical  values  come  from  the  t{22)  distribution:  122,0.025  =  2.074.  The  actual 
interval  becomes: 


1  78  1  78 

0.82  -  2.074  •  0.82  +  2.074  •  ^ 

723  7^ 


=  (0.050,  1.590). 
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23.5  b  Generate  one  thousand  samples  of  size  23,  by  drawing  with  replacement 
from  the  23  numbers 


1.06,  1.04,  2.62,  ...,  2.01. 

For  each  sample  Xj, a;2i  •  •  •  i ®23  compute:  t*  =  0:23  —  0.82/(s23/\/^),  where  S23  = 
\l  22  “  *23)^' 

23.5  c  We  need  to  estimate  the  critical  value  c]  such  that  P(r*  <  cj")  ~  0.025.  We 
take  c*  =  —2.101,  the  25th  of  the  ordered  values,  an  estimate  for  the  25/1000  =  0.025 
quantile.  Similarly,  c*  is  estimated  by  the  976th,  which  is  2.088. 

The  bootstrap  confidence  interval  uses  the  c*  values  instead  of  the  t-distribution 
values  ±tn-i,a/2,  but  beware:  c/  is  from  the  left  tail  and  appears  on  the  right-hand 
side  of  the  interval  and  on  the  left-hand  side: 


Substituting  c/  =  —2.101  and  =  2.088,  the  confidence  interval  becomes: 

/  1  yo  1  yo  \ 

1^0.82  -  2.088  •  0.82  -b  2.101  •  ^  j  =  (0.045,  1.600). 

23.6  a  Because  events  described  by  inequalities  do  not  change  when  we  multi¬ 
ply  the  inqualities  by  a  positive  constant  or  add  or  subtract  a  constant,  the 
following  equalities  hold:  <  0  <  Un'^  =  P(3Ln  -b  7  <  3^1  -b  7  <  3U„  -b  7)  = 

P(3Z/„  <  3/i  <  3Un)  —  P{Ln  <  <  Un),  and  this  equals  0.95,  as  is  given. 

23.6  b  The  confidence  interval  for  0  is  obtained  as  the  realization  of  {Ln,  Un),  that 
is:  {ln,Un)  =  (3Zn  +  7 ,  3un  -b  7).  This  is  obtained  by  transforming  the  confidence 
interval  for  ^  (using  the  transformation  that  is  applied  to  to  get  6). 

23.6  c  We  start  with  P{Ln  <  fi  <  Un)  ~  0.95  and  try  to  get  1  —  /r  in  the  mid¬ 
dle:  P(L„  <  fj,  <  Un)  =  P{  —  Ln  >  — /i  >  —Un)  =  P(1  —  Ln>l—  At>l  —  Un)  = 
P(l  —  <  1  —  fi  <  1  —  Ln),  where  we  see  that  the  minus  sign  causes  an  inter¬ 

change:  Ln  =  1  —  Un  and  =  1  —  L„.  The  conhdence  interval:  (1  —  5,  1  —  (—2))  = 
(-4,3). 

23.6  d  If  we  knew  that  and  Un  were  always  positive,  then  we  could  conclude: 
P{Ln  <  fJ.  <  Un)  —  P{L^  <  <  Un)  and  we  could  just  square  the  numbers  in  the 

confidence  interval  for  fj,  to  get  the  one  for  9.  Without  the  positivity  assumption,  the 
sharpest  conclusion  you  can  draw  from  Ln  <  fJ-  <  Un  is  that  is  smaller  than  the 
maximum  of  Ln  and  Un-  So,  0.95  =  P{Ln  <  fJ,  <  U„)  <  P(0  <  <  max{I/^,  Un}) 

and  the  confidence  interval  [0, max{Z^, u^})  =  [0,25)  has  a  confidence  of  at  least 
95%.  This  kind  of  problem  may  occur  when  the  transformation  is  not  one-to-one 
(both  —1  and  1  are  mapped  to  1  by  squaring). 

23.11a  For  the  98%  confidence  interval  the  same  formula  is  used  as  for  the  95% 
interval,  replacing  the  critical  values  by  larger  ones.  This  is  the  case,  no  matter 
whether  the  critical  values  are  from  the  normal  or  t-distribution,  or  from  a  bootstrap 
experiment.  Therefore,  the  98%  interval  contains  the  95%,  and  so  must  also  contain 
the  number  0. 
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23.11b  From  a  new  bootstrap  experiment  we  would  obtain  new  and,  most  prob¬ 
ably,  different  values  and  cf.  It  therefore  could  be,  if  the  number  0  is  close  to 
the  edge  of  the  first  bootstrap  confidence  interval,  that  it  is  just  outside  the  new 
interval. 

23.11  c  The  new  dataset  will  resemble  the  old  one  in  many  ways,  but  things  like  the 
sample  mean  would  most  likely  differ  from  the  old  one,  and  so  there  is  no  guarantee 
that  the  number  0  will  again  be  in  the  confidence  interval. 

24.6  a  The  environmentalists  are  interested  in  a  lower  confidence  bound,  because 
they  would  like  to  make  a  statement  like  “We  are  97.5%  confidence  that  the  con¬ 
centration  exceeds  1.68  ppm  [and  that  is  much  too  high.]”  We  have  normal  data, 
with  a  unknown  so  we  use  sie  =  \/1.12  =  1.058  as  an  estimate  and  use  the  criti¬ 
cal  value  corresponding  to  2.5%  from  the  t(15)  distribution:  tis, 0.025  =  2.131.  The 
lower  confidence  bound  is  2.24  —  2.131-  1.058/-v/l6  =  2.24  —  0.56  =  1.68,  the  interval: 
(1.68,  00). 

24.6  b  For  similar  reasons,  the  plant  management  constructs  an  upper  confidence 
bound  (“We  are  97.5%  confident  pollution  does  not  exceed  2.80  [and  this  is  ac¬ 
ceptable.]”).  The  computation  is  the  same  except  for  a  minus  sign:  2.24  -|-  2.131  • 
1.058/ VlT  =  2.24  +  0.56  =  2.80,  so  the  interval  is  [0,  2.80).  Note  that  the  computed 
upper  and  lower  bounds  are  in  fact  the  endpoints  of  the  95%  two-sided  confidence 
interval. 

24.9a  From  Section  8.4  we  know:  P{M  <  a)  =  \Fx{a)]^^ ,  so  P{M/6<t)  = 
P{M<9t)  =  [Fx{dt)]^'^.  Since  Xi  has  a  U{0,6)  distribution,  Fx{0t)  =  t,  for 
0  <  f  <  1.  Substituting  this  shows  the  result. 

24.9  b  For  ci  we  need  to  solve  {cif^  =  a/2,  or  c;  =  (0/2)%!^  =  (0.05)1/1^  =  0.7791. 
For  Cu  we  need  to  solve  (cu)^^  =  1  — a/2,  or  c„  =  (1— a/2)^/^^  =  (0.95)^^*^^  =  0.9958. 

24.9  c  From  b  we  know  that  P(ci  <  M/9  <  Cu)  =  P(0.7790  <  M/9  <  0.9958)  = 
0.90.  Rewriting  this  equation,  we  get:  P(0. 7790  0  <  M  <  0.9958  0)  =  0.90  and 
P(M/0.9958  <9  <  M/0.7790)  =  0.90.  This  means  that  (m/0.9958,  m/0.7790)  = 
(3.013,  3.851)  is  a  90%  confidence  interval  for  9. 

24.9  d  From  b  we  derive  the  general  formula: 

p(^(a/2)'/"  <  ^  <  (1  -  a/2)'/"^  =  1  -  a. 

The  left  hand  inequality  can  be  rewritten  as  0  <  M/(a/2)^/"  and  the  right  hand 
one  as  M/(l  —  a/2)^/”'  <  9.  So,  the  statement  above  can  be  rewritten  as: 

M  ^  M  \  , 

V(l-a/2)i/'‘  {a/2)^M  J 

so  that  the  general  formula  for  the  confidence  interval  becomes: 

m  m 

(1  -  a/2)i/"  ’  (a/2)i/" 

25.4  a  Denote  the  observed  numbers  of  cycles  for  the  smokers  by  Xi,  X2, . . . , 
and  similarly  ¥^,¥2, . . .  ,Yn2  for  the  nonsmokers.  A  test  statistic  should  compare 
estimators  for  pi  and  P2.  Since  the  geometric  distributions  have  expectations  1/pi 


D  Full  solutions  to  selected  exercises 


471 


and  l/p2,  we  could  compare  the  estimator  Ij for  pi  with  the  estimator  llYn^  for 
P2,  or  simply  compare  with  Yn^-  For  instance,  take  test  statistic  T  =  —Yn^. 

Values  of  T  close  to  zero  are  in  favor  of  Hq,  and  values  far  away  from  zero  are  in 
favor  of  Hi.  Another  possibility  is  T  = 

25.4  b  In  this  case,  the  maximum  likelihood  estimators  pi  and  P2  give  better  indi¬ 
cations  about  Pi  and  P2.  They  can  be  compared  in  the  same  way  as  the  estimators 
in  a. 

25.4  c  The  probability  of  getting  pregnant  during  a  cycle  is  pi  for  the  smoking 
women  and  p2  for  the  nonsmokers.  The  alternative  hypothesis  should  express  the 
belief  that  smoking  women  are  less  likely  to  get  pregnant  than  nonsmoking  women. 
Therefore  take  Hi  :  pi  <  P2. 

25.10  a  The  alternative  hypothesis  should  express  the  belief  that  the  gross  calorific 
exceeds  23.75  MJ/kg.  Therefore  take  Hi  :  p  >  23.75. 

25.10  b  The  p-value  is  the  probability  P(X„  >  23.788)  under  the  null  hypothesis. 
We  can  compute  this  probability  by  using  that  under  the  null  hypothesis  X„  has  an 
A(23.75,  (0.1)^/23)  distribution: 


P(X„  >  23.788)  =  P 


X„  -  23.75  ^  23.788  -  23.75  \ 
“  0.1/^/23  ) 


Y{Z  >  1.82), 


where  Z  has  an  fV(0, 1)  distribution.  From  Table  B.l  we  find  P(Z  >  1.82)  =  0.0344. 

25.11  A  type  I  error  occurs  when  p  =  0  and  |t|  >  2.  When  p  =  0,  then  T  has  an 
N{0, 1)  distribution.  Hence,  by  symmetry  of  the  N{0, 1)  distribution  and  Table  B.l, 
we  find  that  the  probability  of  committing  a  type  I  error  is 


P(|T|  >  2)  =  P(T  <  -2)  -b  P(T  >  2)  =  2  •  P(T  >  2)  =  2  •  0.0228  =  0.0456. 

26.5  a  The  p-value  is  P(X  >  15)  under  the  null  hypothesis  Hq  :  p  =  1/2.  Using 
Table  26.3  we  find  P(X  >  15)  =  1  -  P(X  <  14)  =  1  -  0.8950  =  0.1050. 

26.5  b  Only  values  close  to  23  are  in  favor  of  Hi  :  p  >  1/2,  so  the  critical  region  is 
of  the  form  K  =  {c,  c  -I-  1, . . . ,  23}.  The  critical  value  c  is  the  smallest  value,  such 
that  P(X  >  c)  <  0.05  under  Hq  :  p  =  1/2,  or  equivalently,  1  —  P(X  <  c  —  1)  <  0.05, 
which  means  P(X  <  c  —  1)  >  0.95.  From  Table  26.3  we  conclude  that  c  —  1  =  15,  so 
that  K  =  {16, 17,  ...,23}. 

26.5  c  A  type  I  error  occurs  if  p  =  1/2  and  X  >  16.  The  probability  that  this 
happens  is  P(X  >  16  |  p  =  1/2)  =  1  -  P(X  <  15  |  p  =  1/2)  =  1  -  0.9534  =  0.0466, 
where  we  have  used  Table  26.3  once  more. 

26.5  d  In  this  case,  a  type  II  error  occurs  if  p  =  0.6  and  X  <  15.  To  approximate 
P(X  <  15  I  p  =  0.6),  we  use  the  same  reasoning  as  in  Section  14.2,  but  now  with 
n  =  23  and  p  =  0.6.  Write  X  as  the  sum  of  independent  Bernoulli  random  variables: 
X  =  i7i  +  •  •  •  -b  Rn,  and  apply  the  central  limit  theorem  with  p  —  p  —  0.6  and 
=  p(l  —  p)  =  0.24.  Then 

P(X  <  15)  =  P{Ri  +---  +  Rn<15) 

_  p  /  i?i  H - \-  R„  —  np  ^  15  — np 

\  ay/n  ~  ay/n 

=  p(z23>  «  $(0.51)  =  0.6950. 

V  V(mx/23/ 
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26.8  a  Test  statistic  T  =  takes  values  in  (0,  cxs).  Recall  that  the  Exp{X)  distri¬ 
bution  has  expectation  1/A,  and  that  according  to  the  law  of  large  numbers  X„  will 
be  close  to  1/A.  Hence,  values  of  X„  close  to  1  are  in  favor  of  Ho  :  A  =  1,  and  only 
values  of  X„  close  to  zero  are  in  favor  Hi  :  A  >  1.  Large  values  of  Xn  also  provide 
evidence  against  Hq  :  A  =  1,  but  even  stronger  evidence  against  ITi  :  A  >  1.  We 
conclude  that  T  =  Xn  has  critical  region  K  =  (0,c;].  This  is  an  example  in  which 
the  alternative  hypothesis  and  the  test  statistic  deviate  from  the  null  hypothesis  in 
opposite  directions. 

Test  statistic  T'  =  e“^"  takes  values  in  (0, 1).  Values  of  Xn  close  to  zero  correspond 
to  values  of  T'  close  to  1,  and  large  values  of  Xn  correspond  to  values  of  T'  close 
to  0.  Hence,  only  values  of  T'  close  to  1  are  in  favor  Hi  :  A  >  1.  We  conclude  that  T' 
has  critical  region  K'  =  [cu,  1).  Here  the  alternative  hypothesis  and  the  test  statistic 
deviate  from  the  null  hypothesis  in  the  same  direction. 

26.8  b  Again,  values  of  X„  close  to  1  are  in  favor  of  Hq  :  A  =  1.  Values  of  Xn  close 
to  zero  suggest  A  >  1,  whereas  large  values  of  Xn  suggest  A  <  1.  Hence,  both  small 
and  large  values  of  Xn  are  in  favor  of  Hi  :  A  7^  1.  We  conclude  that  T  =  Xn  has 
critical  region  K  —  (0,  cj]  U  [cu,oo). 

Small  and  large  values  of  Xn  correspond  to  values  of  T'  close  to  1  and  0.  Hence, 
values  of  T'  both  close  to  0  and  close  1  are  in  favor  of  Hi  :  A  7^  1.  We  conclude  that 
T'  has  critical  region  K'  =  (0,  C;]  U  [c/,  1).  Both  test  statistics  deviate  from  the  null 
hypothesis  in  the  same  directions  as  the  alternative  hypothesis. 

26.9a  Test  statistic  T  =  (X„)^  takes  values  in  [0,oo).  Since  p  is  the  expectation 
of  the  N{p,  1)  distribution,  according  to  the  law  of  large  numbers,  Xn  is  close  to  p. 
Hence,  values  of  Xn  close  to  zero  are  in  favor  of  Ho  :  p  —  0.  Large  negative  values 
of  Xn  suggest  p  <  0,  and  large  positive  values  of  Xn  suggest  p  >  0.  Therefore,  both 
large  negative  and  large  positive  values  of  Xn  are  in  favor  of  Hi  :  p  ^  0.  These 
values  correspond  to  large  positive  values  of  T,  so  T  has  critical  region  K  —  [cu,  00). 
This  is  an  example  in  which  the  test  statistic  deviates  from  the  null  hypothesis  in 
one  direction,  whereas  the  alternative  hypothesis  deviates  in  two  directions. 

Test  statistic  T'  takes  values  in  (— oo,0)  U  (0,  00).  Large  negative  values  and  large 
positive  values  of  Xn  correspond  to  values  of  T'  close  to  zero.  Therefore,  T'  has 
critical  region  K'  =  [c;,0)  U  (0,  c/].  This  is  an  example  in  which  the  test  statistic 
deviates  from  the  null  hypothesis  for  small  values,  whereas  the  alternative  hypothesis 
deviates  for  large  values. 

26.9  b  Only  large  positive  values  of  Xn  are  in  favor  of  /r  >  0,  which  correspond  to 
large  values  of  T.  Hence,  T  has  critical  region  K  =  [cn,  00).  This  is  an  example  where 
the  test  statistic  has  the  same  type  of  critical  region  with  a  one-sided  or  two-sided 
alternative.  Of  course,  the  critical  value  Cu  in  part  b  is  different  from  the  one  in 
part  a. 

Large  positive  values  of  X„  correspond  to  small  positive  values  of  T' .  Hence,  T'  has 
critical  region  K'  =  (0,  c/].  This  is  another  example  where  the  test  statistic  deviates 
from  the  null  hypothesis  for  small  values,  whereas  the  alternative  hypothesis  deviates 
for  large  values. 

27.5  a  The  interest  is  whether  the  inbreeding  coefficient  exceeds  0.  Let  p  represent 
this  coefficient  for  the  species  of  wasps.  The  value  0  is  the  a  priori  specified  value 
of  the  parameter,  so  test  null  hypothesis  Ho  '■  p  —  0.  The  alternative  hypothesis 
should  express  the  belief  that  the  inbreeding  coefficient  exceeds  0.  Hence,  we  take 
alternative  hypothesis  Hi  :  p  >  0.  The  value  of  the  test  statistic  is 
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0.884/VlW 

27.5  b  Because  n  —  197  is  large,  we  approximate  the  distribution  of  T  under  the 
null  hypothesis  by  an  N{0, 1)  distribution.  The  value  t  =  0.70  lies  to  the  right  of 
zero,  so  the  p- value  is  the  right  tail  probability  P(r  >  0.70).  By  means  of  the  normal 
approximation  we  find  from  Table  B.l  that  the  right  tail  probability 


P(r  >  0.70)  «  1  -  $(0.70)  =  0.2420. 


This  means  that  the  value  of  the  test  statistic  is  not  very  far  in  the  (right)  tail  of 
the  distribution  and  is  therefore  not  to  be  considered  exceptionally  large.  We  do  not 
reject  the  null  hypothesis. 

27.7  a  The  data  are  modeled  by  a  simple  linear  regression  model:  Yi  =  a  +  /3xi, 
where  Yi  is  the  gas  consumption  and  Xi  is  the  average  outside  temperature  in  the  ith 
week.  Higher  gas  consumption  as  a  consequence  of  smaller  temperatures  corresponds 
to  /3  <  0.  It  is  natural  to  consider  the  value  0  as  the  a  priori  specified  value  of  the 
parameter  (it  corresponds  to  no  change  of  gas  consumption).  Therefore,  we  take  null 
hypothesis  Hq  :  j3  =  0.  The  alternative  hypothesis  should  express  the  belief  that  the 
gas  consumption  increases  as  a  consequence  of  smaller  temperatures.  Hence,  we  take 
alternative  hypothesis  Hi  :  /3  <  0.  The  value  of  the  test  statistic  is 


^  _  -0.3932 
Sb  0.0196 


-20.06. 


The  test  statistic  T),  has  a  t-distribution  with  n  —  2  =  24  degrees  of  freedom.  The 
value  —20.06  is  smaller  than  the  left  critical  value  t24,o.05  =  —1.711,  so  we  reject. 
27.7  b  For  the  data  after  insulation,  the  value  of  the  test  statistic  is 


-0.2779 

0.0252 


-11.03, 


and  Tb  has  a  t{28)  distribution.  The  value  —11.03  is  smaller  than  the  left  critical 
value  t28,o.o5  =  —1.701,  so  we  reject. 

28.5  a  When  aSx  +  bSy  is  unbiased  for  a^,  we  should  have  E  [a5'^  +  bSy]  ~  cr^. 
Using  that  Sx  and  Sy  are  both  unbiased  for  i.e.,  E  [5^]  =  cr  and  E  [Sy]  =  cr^, 
we  get 

E  [aSx  +  bS^  =  oE  [Si]  +  6E  [S'y]  =  (a  +  b)a^. 

Hence,  E  [aSx  +  bS^]  =  for  all  n  >  0  if  and  only  if  a  +  6  =  1. 

28.5  b  By  independence  of  Sx  and  Sy  write 


Var(a5x  +  (1  —  a)S'v) 


a^Var(S'x)  +  (1  —  a 
\n  —  1  m  —  1  ) 


1  2(j'‘. 


To  find  the  value  of  a  that  minimizes  this,  differentiate  with  respect  to  o  and  put 
the  derivative  equal  to  zero.  This  leads  to 

2o  2(1  —  a)  ^ 

n  —  1  m  —  1 

Solving  for  a  yields  a  =  (n  —  l)/(n  +  m  —  2).  Note  that  the  second  derivative  of 
Vai^aSx  +  (1  ~  n)S'v)  is  positive  so  that  this  is  indeed  a  minimum. 
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